Error when all variants get filtered from gnomAD blacklist / empty VCF is given

arine commented 2 weeks ago

Describe the bug When all variants get filtered (or, I believe an empty VCF will do the same), the score file under the "rami-test" directory becomes empty. Thus, this case produces below error:

Traceback (most recent call last):
  File "/run/generate_new_matrix_2.py", line 17, in <module>
    score=pd.read_csv('/out/rami-test/%s_scores.csv'%(sys.argv[1]))
  File "/usr/local/lib/python3.8/dist-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 680, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 575, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 934, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 1236, in _make_engine
    return mapping[engine](f, **self.options)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/c_parser_wrapper.py", line 75, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 551, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file

Expected behavior The pipeline should finish without error and produce an output file 1) containing the same number of content lines as the number of input variants, possibly with an AIM score of 0, or, 2) with no line but the header.

Input data demo_short.vcf.zip

Genome build hg19

jylee-bcm commented 2 weeks ago

I've identified the cause of the issue and would like to share my findings before proceeding with the bug fix. The primary problem, as you noted, is that all the variants were filtered out due to the blacklist, as seen here: https://github.com/LiuzLab/AI_MARRVEL/blob/162ae44442f5cae1cc749fd6740d9fa07c678ab7/run/proc.sh#L140

Consequently, /run/annotation/main.py fails to run and cannot create the output file rami_test/$1_scores.csv: https://github.com/LiuzLab/AI_MARRVEL/blob/162ae44442f5cae1cc749fd6740d9fa07c678ab7/run/proc.sh#L200-L210

The specific error occurs on the following line, and there are additional errors when there are no rows in the VEP input: https://github.com/LiuzLab/AI_MARRVEL/blob/162ae44442f5cae1cc749fd6740d9fa07c678ab7/run/annotation/main.py#L391

Additionally, the script attempts to execute $ mv a > b when neither a nor b exists, resulting in an error and the creation of an empty file b: https://github.com/LiuzLab/AI_MARRVEL/blob/162ae44442f5cae1cc749fd6740d9fa07c678ab7/run/proc.sh#L218

This causes generate_new_matrix_2.py to fail when executing read_csv(), which expects the input file to have at least one row containing column names.

One possible solution is to modify /run/annotation/main.py to handle empty VCF files, resulting in a valid empty CSV file with only the column names in the first row. However, since /run/annotation/main.py involves multiprocessing, this adds complexity to the code analysis and makes it harder to quickly fix the bug. Now that I understand the problem, I will prioritize making the necessary changes to ensure the pipeline functions as expected.

jylee-bcm commented 1 week ago

I just tested by changing run/annotation/main.py to create empty csv file only containing columns info, and then the next R script /VarTierDiseaseDBFalse.R ran and got another error due to the empty csv file.

I think the approach to create empty output file when the intermediate file became invalid cannot be a solution. Instead, I'd like to suggest making the pipeline stop running when it notices it's not able to finish the workflow, resulting somehow organized error messages and error code.

And the frontend provides the users the most relevant information to help the user to fix their input, based on the error code.

For instance, with the demo_short.vcf, the error code might be ERROR: VCF_IS_EMPTY_AFTER_REMOVING_BLACKLIST, and then the user can be provided with resources regarding the blacklist we are using, and why we are doing so. I'm not sure if it's possible and practical, but for more convenience for user-side, we can have an option on input to ignore blacklist filtering step.

Do you think the suggestion would work?

LiuzLab / AI_MARRVEL

Error when all variants get filtered from gnomAD blacklist / empty VCF is given #19