Closed Eduardo-Auer closed 2 years ago
Hi,
Sorry for the late reply. Thank you for your feedback. To solve your issues, can you share your input GC file with me? It looks that the GC file (the GC table file with GC content) has some issues. You can email it to my email address lijj36@ucla.edu
Best regards, Albert
Hi,
I have received your file. Thank you very much. Have you updated the software to its latest version? Could you do conda update forestqc -c avallonking
to update ForestQC and try again to see whether this error still exists? I came across a similar issue before and I have fixed it with the latest version of ForestQC.
Best regards, Albert
Hello,
Thank you very much for the advice! It worked! However, I got another error in the "ForestQC classify" step:
ForestQC v1.1.5.4 by Jae Hoon Sul Lab at UCLA
--Quality control on genetic variants from next-generation sequencing data using random forest
Loading data...
/root/miniconda3/lib/python3.9/site-packages/ForestQC/classification.py:54: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction.
grey_variants.fillna(grey_variants.median()[_features], inplace=True)
Traceback (most recent call last):
File "/root/miniconda3/bin/ForestQC", line 33, in <module>
sys.exit(load_entry_point('ForestQC==1.1.5.4', 'console_scripts', 'ForestQC')())
File "/root/miniconda3/lib/python3.9/site-packages/ForestQC/__main__.py", line 201, in main
command_functions[command](**args)
File "/root/miniconda3/lib/python3.9/site-packages/ForestQC/__main__.py", line 186, in main_classify
execute_classification(good_var, bad_var, gray_var, model, output_suffix, features, prob_threshold)
File "/root/miniconda3/lib/python3.9/site-packages/ForestQC/classification.py", line 113, in execute_classification
pred, prob = classification(good, bad, grey, model, user_features, threshold)
File "/root/miniconda3/lib/python3.9/site-packages/ForestQC/classification.py", line 94, in classification
pred, prob = rf_model[model](pd.concat([good.sample(n=bad.shape[0], random_state=9), bad]), grey, user_features,
File "/root/miniconda3/lib/python3.9/site-packages/ForestQC/classification.py", line 57, in random_forest_classifierB
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=1)
File "/root/miniconda3/lib/python3.9/site-packages/sklearn/model_selection/_split.py", line 2422, in train_test_split
n_train, n_test = _validate_shuffle_split(
File "/root/miniconda3/lib/python3.9/site-packages/sklearn/model_selection/_split.py", line 2098, in _validate_shuffle_split
raise ValueError(
ValueError: With n_samples=0, test_size=0.2 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.
Might the cause be the lack of bad variants found in the previous step (ForestQC split)?
ForestQC v1.1.5.4 by Jae Hoon Sul Lab at UCLA
--Quality control on genetic variants from next-generation sequencing data using random forest
Loading data...
Data processing...
Current filter settings:
Good variants
----------------
Mendel_Error <= 0.04478
Missing_Rate < 0.005
HWE > 0.01
0.3 <= ABHet_deviation <= 0.7
Bad variants
----------------
Rare variants (MAF < 0.03):
Mendel_Error > 0.04478
Missing_Rate > 0.02
HWE < 0.005
ABHet_deviation > 0.25
Common variants (MAF >= 0.03):
Mendel_Error > 0.07463
Missing_Rate > 0.03
HWE < 0.0005
ABHet_deviation > 0.25
Outlier variants
----------------
Rare variants (MAF < 0.03):
Mendel_Error > 0.1194
Missing_Rate > 0.08
HWE < 0.001
Common variants (MAF >= 0.03):
Mendel_Error > 0.14925
Missing_Rate > 0.1
HWE < 1e-08
Number of variants
Good variants: 2786346
Bad variants: 0
Grey variants: 2363824
Writing data...
Done.
Hi,
Yes, the reason is that there are no bad variants in the "ForestQC split" step, but the "ForestQC classify" step needs bad variants to train the model. This is related to the cutoff values in the "ForestQC split" step and the statistics calculated from the "ForestQC stat" step.
My question is: How many samples do you have in your VCF file? To solve this issue, I need to take a look at your input file. Can you email me your input VCF file? Or you can send me the output file of "ForestQC stat".
Best regards, Albert
Hi,
Yes, the reason is that there are no bad variants in the "ForestQC split" step, but the "ForestQC classify" step needs bad variants to train the model. This is related to the cutoff values in the "ForestQC split" step and the statistics calculated from the "ForestQC stat" step.
My question is: How many samples do you have in your VCF file? To solve this issue, I need to take a look at your input file. Can you email me your input VCF file? Or you can send me the output file of "ForestQC stat".
Best regards, Albert
Hi,
I have a single sample in my VCF (this is my genome variants; therefore, I do not have a problem sharing my VCF). I will send an email with the input file and output file of "ForestQC stat".
I am deeply grateful to everyone who helped me! It successfully worked in my single sample VCFs that I tested.
Best regards, Eduardo.
Hello, I noticed an error when using ForestQC stat:
Is there any solution for this?