gagneurlab / MMSplice_MTSplice

Tissue-specific variant effect predictions on splicing
MIT License
40 stars 21 forks source link

BUG: VCF requirements not clearly stated & wrong seqnames derived #70

Open KuechlerO opened 7 months ago

KuechlerO commented 7 months ago

Firstly thx for your amazing work. I really like the concept behind your tool :)

Now concerning the bug: I worked with a fresh Colab notebook

Description

I wanted to try out mmsplice, so I've downloaded your example data and gave it a go. But even though I used your test data, I constantly received the following error: ValueError: Fasta chrom names do not match with vcf chrom names.

After hours of wrapping my head around and hacking with the package code I found out:

  1. Not only a VCF-file is required as input, but also its index version (i.e. vcf.gz.tbi). That was never stated (check your ReadMe)... Please fix that.

  2. Also, when hacking around, I realized the parsing of the seqnames seems buggy.

    • It only parses seqnames when an indexed VCF file is provided
    • But it also always includes {'1'} as seqname for the VCF, no matter what is provided?!
# In your vcf_dataloader.py
def _check_chrom_annotation(self):
        # I've added these two lines
        fasta_chroms = set(self.fasta.fasta.keys())
        vcf_chroms = set(self.vcf.seqnames)

        print("fasta: ", fasta_chroms, flush=True)
        print("vcf_chroms", vcf_chroms, flush=True)
        if not fasta_chroms.intersection(vcf_chroms):
            raise ValueError(
                'Fasta chrom names do not match with vcf chrom names')

--> Output:
fasta:  {'17'}
vcf_chroms {'1', '17'}
...

The VCF seqnames should only include 17, since I've just provided your example.

What I Did

Take a look: https://colab.research.google.com/drive/1hx6PAYT_lKuEYtnHCq0PN2lyvNBNDud1?usp=sharing

KuechlerO commented 7 months ago

Ahh, the 1 might come from the annotated contig in the vcf file: ##contig=<ID=1,length=16>