Open arine opened 2 weeks ago
I've identified the cause of the issue and would like to share my findings before proceeding with the bug fix. The primary problem, as you noted, is that all the variants were filtered out due to the blacklist, as seen here: https://github.com/LiuzLab/AI_MARRVEL/blob/162ae44442f5cae1cc749fd6740d9fa07c678ab7/run/proc.sh#L140
Consequently, /run/annotation/main.py
fails to run and cannot create the output file rami_test/$1_scores.csv
:
https://github.com/LiuzLab/AI_MARRVEL/blob/162ae44442f5cae1cc749fd6740d9fa07c678ab7/run/proc.sh#L200-L210
The specific error occurs on the following line, and there are additional errors when there are no rows in the VEP input: https://github.com/LiuzLab/AI_MARRVEL/blob/162ae44442f5cae1cc749fd6740d9fa07c678ab7/run/annotation/main.py#L391
Additionally, the script attempts to execute $ mv a > b
when neither a nor b exists, resulting in an error and the creation of an empty file b:
https://github.com/LiuzLab/AI_MARRVEL/blob/162ae44442f5cae1cc749fd6740d9fa07c678ab7/run/proc.sh#L218
This causes generate_new_matrix_2.py
to fail when executing read_csv()
, which expects the input file to have at least one row containing column names.
One possible solution is to modify /run/annotation/main.py
to handle empty VCF files, resulting in a valid empty CSV file with only the column names in the first row. However, since /run/annotation/main.py
involves multiprocessing, this adds complexity to the code analysis and makes it harder to quickly fix the bug. Now that I understand the problem, I will prioritize making the necessary changes to ensure the pipeline functions as expected.
I just tested by changing run/annotation/main.py
to create empty csv file only containing columns info, and then the next R script /VarTierDiseaseDBFalse.R
ran and got another error due to the empty csv file.
I think the approach to create empty output file when the intermediate file became invalid cannot be a solution. Instead, I'd like to suggest making the pipeline stop running when it notices it's not able to finish the workflow, resulting somehow organized error messages and error code.
And the frontend provides the users the most relevant information to help the user to fix their input, based on the error code.
For instance, with the demo_short.vcf
, the error code might be ERROR: VCF_IS_EMPTY_AFTER_REMOVING_BLACKLIST
, and then the user can be provided with resources regarding the blacklist we are using, and why we are doing so. I'm not sure if it's possible and practical, but for more convenience for user-side, we can have an option on input to ignore blacklist filtering step.
Do you think the suggestion would work?
Describe the bug When all variants get filtered (or, I believe an empty VCF will do the same), the score file under the "rami-test" directory becomes empty. Thus, this case produces below error:
Expected behavior The pipeline should finish without error and produce an output file 1) containing the same number of content lines as the number of input variants, possibly with an AIM score of 0, or, 2) with no line but the header.
Input data demo_short.vcf.zip
Genome build hg19