avallonking / ForestQC

Quality control on genetic variants from next-generation sequencing data using random forest
MIT License
21 stars 7 forks source link

ForestQC stat error #7

Closed Eduardo-Auer closed 2 years ago

Eduardo-Auer commented 2 years ago

Hello, I noticed an error when using ForestQC stat:

ForestQC v1.1.5.4 by Jae Hoon Sul Lab at UCLA
--Quality control on genetic variants from next-generation sequencing data using random forest

Loading files...
Computing...
Traceback (most recent call last):
  File "/root/miniconda3/bin/ForestQC", line 33, in <module>
    sys.exit(load_entry_point('ForestQC==1.1.5.4', 'console_scripts', 'ForestQC')())
  File "/root/miniconda3/lib/python3.9/site-packages/ForestQC/__main__.py", line 201, in main
    command_functions[command](**args)
  File "/root/miniconda3/lib/python3.9/site-packages/ForestQC/__main__.py", line 129, in main_stat
    vcf_process(target_file, stat_file, gc_file, ped_file, discord_geno_dict, hwe_file, gender_file, dp, gq,
  File "/root/miniconda3/lib/python3.9/site-packages/ForestQC/stat.py", line 63, in vcf_process
    gc = getGC(pos, gc_table_by_chr[chr])
  File "/root/miniconda3/lib/python3.9/site-packages/ForestQC/vcf_stat.py", line 99, in getGC
    step = gc_table.iloc[2,1] - gc_table.iloc[1,1]
  File "/root/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py", line 925, in __getitem__
    return self._getitem_tuple(key)
  File "/root/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py", line 1506, in _getitem_tuple
    self._has_valid_tuple(tup)
  File "/root/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py", line 754, in _has_valid_tuple
    self._validate_key(k, i)
  File "/root/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py", line 1409, in _validate_key
    self._validate_integer(key, axis)
  File "/root/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py", line 1500, in _validate_integer
    raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds

Is there any solution for this?

avallonking commented 2 years ago

Hi,

Sorry for the late reply. Thank you for your feedback. To solve your issues, can you share your input GC file with me? It looks that the GC file (the GC table file with GC content) has some issues. You can email it to my email address lijj36@ucla.edu

Best regards, Albert

avallonking commented 2 years ago

Hi,

I have received your file. Thank you very much. Have you updated the software to its latest version? Could you do conda update forestqc -c avallonking to update ForestQC and try again to see whether this error still exists? I came across a similar issue before and I have fixed it with the latest version of ForestQC.

Best regards, Albert

Eduardo-Auer commented 2 years ago

Hello,

Thank you very much for the advice! It worked! However, I got another error in the "ForestQC classify" step:

ForestQC v1.1.5.4 by Jae Hoon Sul Lab at UCLA
--Quality control on genetic variants from next-generation sequencing data using random forest

Loading data...
/root/miniconda3/lib/python3.9/site-packages/ForestQC/classification.py:54: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  grey_variants.fillna(grey_variants.median()[_features], inplace=True)
Traceback (most recent call last):
  File "/root/miniconda3/bin/ForestQC", line 33, in <module>
    sys.exit(load_entry_point('ForestQC==1.1.5.4', 'console_scripts', 'ForestQC')())
  File "/root/miniconda3/lib/python3.9/site-packages/ForestQC/__main__.py", line 201, in main
    command_functions[command](**args)
  File "/root/miniconda3/lib/python3.9/site-packages/ForestQC/__main__.py", line 186, in main_classify
    execute_classification(good_var, bad_var, gray_var, model, output_suffix, features, prob_threshold)
  File "/root/miniconda3/lib/python3.9/site-packages/ForestQC/classification.py", line 113, in execute_classification
    pred, prob = classification(good, bad, grey, model, user_features, threshold)
  File "/root/miniconda3/lib/python3.9/site-packages/ForestQC/classification.py", line 94, in classification
    pred, prob = rf_model[model](pd.concat([good.sample(n=bad.shape[0], random_state=9), bad]), grey, user_features,
  File "/root/miniconda3/lib/python3.9/site-packages/ForestQC/classification.py", line 57, in random_forest_classifierB
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=1)
  File "/root/miniconda3/lib/python3.9/site-packages/sklearn/model_selection/_split.py", line 2422, in train_test_split
    n_train, n_test = _validate_shuffle_split(
  File "/root/miniconda3/lib/python3.9/site-packages/sklearn/model_selection/_split.py", line 2098, in _validate_shuffle_split
    raise ValueError(
ValueError: With n_samples=0, test_size=0.2 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.

Might the cause be the lack of bad variants found in the previous step (ForestQC split)?

ForestQC v1.1.5.4 by Jae Hoon Sul Lab at UCLA
--Quality control on genetic variants from next-generation sequencing data using random forest

Loading data...
Data processing...

Current filter settings:

Good variants
----------------
Mendel_Error <= 0.04478
Missing_Rate < 0.005
HWE > 0.01
0.3 <= ABHet_deviation <= 0.7

Bad variants
----------------
Rare variants (MAF < 0.03):
        Mendel_Error > 0.04478
        Missing_Rate > 0.02
        HWE < 0.005
        ABHet_deviation > 0.25

Common variants (MAF >= 0.03):
        Mendel_Error > 0.07463
        Missing_Rate > 0.03
        HWE < 0.0005
        ABHet_deviation > 0.25

Outlier variants
----------------
Rare variants (MAF < 0.03):
        Mendel_Error > 0.1194
        Missing_Rate > 0.08
        HWE < 0.001

Common variants (MAF >= 0.03):
        Mendel_Error > 0.14925
        Missing_Rate > 0.1
        HWE < 1e-08

Number of variants
Good variants: 2786346
Bad variants: 0
Grey variants: 2363824

Writing data...
Done.
avallonking commented 2 years ago

Hi,

Yes, the reason is that there are no bad variants in the "ForestQC split" step, but the "ForestQC classify" step needs bad variants to train the model. This is related to the cutoff values in the "ForestQC split" step and the statistics calculated from the "ForestQC stat" step.

My question is: How many samples do you have in your VCF file? To solve this issue, I need to take a look at your input file. Can you email me your input VCF file? Or you can send me the output file of "ForestQC stat".

Best regards, Albert

Eduardo-Auer commented 2 years ago

Hi,

Yes, the reason is that there are no bad variants in the "ForestQC split" step, but the "ForestQC classify" step needs bad variants to train the model. This is related to the cutoff values in the "ForestQC split" step and the statistics calculated from the "ForestQC stat" step.

My question is: How many samples do you have in your VCF file? To solve this issue, I need to take a look at your input file. Can you email me your input VCF file? Or you can send me the output file of "ForestQC stat".

Best regards, Albert

Hi,

I have a single sample in my VCF (this is my genome variants; therefore, I do not have a problem sharing my VCF). I will send an email with the input file and output file of "ForestQC stat".

Eduardo-Auer commented 2 years ago

I am deeply grateful to everyone who helped me! It successfully worked in my single sample VCFs that I tested.

Best regards, Eduardo.