BIMSBbioinfo / janggu

Deep learning infrastructure for genomics
GNU General Public License v3.0
254 stars 33 forks source link

VariantStreamer fails if ROI is supplied when loading the reference genome #12

Closed PedroBarbosa closed 4 years ago

PedroBarbosa commented 4 years ago

Hi,

I aim to generate fasta sequences of variants present in a vcf, so I thought to use VariantStreamer class to do so.

To get strand information on the variant, I automatically generate bed annotations by retrieving the geneID where the variant occurs (from VEP annotations). I fetch genomic coordinates of such genes, thus my final annotations refer to full gene length intervals that for sure span the target variants that occur within genes.

Since loading the entire genome takes quite a while, I want to use the same annotations generated before as the ROI to the Bioseq class. Basically, that's the code and the error I get:

#custom methods to validate vcf and generate annotations
self.vcf = self._validate_vcf(vcf_file)
self.annotations = self.get_variant_annotations()

genome = Bioseq.create_from_refgenome('dna', refgenome=fasta_file, roi=self.annotations)
v = dna.VariantStreamer(bioseq=genome, variants=self.vcf, binsize=5, batch_size=1, annotation=self.annotations)
it_vcf = iter(v.flow())
print(next(it_vcf))

 File "test_vcf.py", line 16, in <module>
    main()
  File "test_vcf.py", line 12, in main
    obj = VariantSeq(args.vcf, args.fasta, args.out_dir)
  File "/Users/pbarbosa/git_repos/ml_genomics/vcf/variants.py", line 47, in __init__
    print(next(it_vcf))
  File "/Users/pbarbosa/miniconda3/envs/ml_genomics/lib/python3.7/site-packages/janggu/data/dna.py", line 717, in flow
    iref = self.bioseq._getsingleitem(Interval(rec.chrom, start, end)).copy()
  File "/Users/pbarbosa/miniconda3/envs/ml_genomics/lib/python3.7/site-packages/janggu/data/dna.py", line 519, in _getsingleitem
    return np.asarray(self.garray[interval][:, 0, 0])
  File "/Users/pbarbosa/miniconda3/envs/ml_genomics/lib/python3.7/site-packages/janggu/data/genomicarray.py", line 274, in __getitem__
    idx = self.region2index[_iv_to_str(chrom, interval.start, interval.end)]
KeyError: 'chr1:21889497-21889502'

If I use instead genome = Bioseq.create_from_refgenome('dna', refgenome=fasta_file, store_whole_genome=True), it works fine but it takes more than 20 minutes to load the genome in my laptop . Why is this happening, considering that the variant context fully overlaps the annotations?

I'd also like to pose some additional questions:

Thank you very much, Best, Pedro

wkopp commented 4 years ago

Hi Pedro,

yes, VEP currently requires to load the Bioseq object using store_whole_genome=True.

There are two solutions:

Regarding your questions:

I've also prepared another VEP tutorial notebook that makes use of these new features. See src/examples/variant_effect_prediction-part2.ipynb.

Best, Wolfgang

PedroBarbosa commented 4 years ago

Awesome @wkopp, thanks for the very quick updates. I'll dig into this.

Best, Pedro