VCF preprocessing requires too strict format

LiuzLab / AI_MARRVEL

AI-MARRVEL (AIM) is an AI system for rare genetic disorder diagnosis

GNU General Public License v3.0

5 stars 4 forks source link

VCF preprocessing requires too strict format #21

Open arine opened 1 week ago

arine commented 1 week ago

Is your feature request related to a problem? Please describe. VCF preprocessing, currently done by bcftools, requires ##FILTER, ##FORMAT, ##INFO, and ##contig, which is too strict.

Describe the solution you'd like Ideally, preprocessing should be done only with CHROM, POS, REF, ALT (and optional FILTER). Here is the example VCF that should be able to pass through the pipeline without error: demo_sloppy.vcf.zip

Describe alternatives you've considered

Additional context

jylee-bcm commented 1 week ago

We can use TSV to convert into VCF with the tool: https://samtools.github.io/bcftools/bcftools.html#convert

# Convert 23andme results into VCF
bcftools convert -c ID,CHROM,POS,AA -s SampleName -f 23andme-ref.fa --tsv2vcf 23andme.txt -o out.vcf.gz

# Convert tab-delimited file into a sites-only VCF (no genotypes), in this example first column to be ignored
bcftools convert -c -,CHROM,POS,REF,ALT -f ref.fa --tsv2vcf calls.txt -o out.bcf

Instead of allowing compromised VCF format into the pipeline, how do you think about the idea of adding conversion step in the pipeline?

If so, Web UI will need to

Have a TSV input option instead of VCF, and
Show the required TSV format clearly.

hyunhwan-bcm commented 1 week ago

I reviewed the example and found that the error occurred when all the variants were blacklisted. Therefore, the data itself is valid, but the error happened in the feature engineering part. Seems relevant to #19? Or can we have another example variants that are not filtered by blacklists regions?