Closed PedroBarbosa closed 4 years ago
Hi Pedro,
yes, VEP currently requires to load the Bioseq object using store_whole_genome=True
.
There are two solutions:
genome = Bioseq.create_from_refgenome('dna', refgenome=fasta_file, store_whole_genome=True, cache=True)
. This caches the Bioseq object and reloads it from the cache in the future, which is much faster.VariantStreamer(dna, ...)
such that the first argument can be a Bioseq object or a reference genome in fasta format. This is currently on github, but will be part of the next pypi version. When supplying a reference genome as argument, the streamer doesn't load the whole genome, but only relevant sequence stretches overlapping the SNVs.Regarding your questions:
model.save()
afterwards use model = Janggu.create_by_name('modelname')
to reload the model. model = Janggu(kerasmodel.inputs, kerasmodel.outputs)
.predict_variant_effect
. It performs the same functionality as Janggu.predict_variant_effect
, but takes as first argument the keras model. This will also be part of the next version, but you could try it from github if you want.I've also prepared another VEP tutorial notebook that makes use of these new features. See src/examples/variant_effect_prediction-part2.ipynb.
Best, Wolfgang
Awesome @wkopp, thanks for the very quick updates. I'll dig into this.
Best, Pedro
Hi,
I aim to generate fasta sequences of variants present in a vcf, so I thought to use VariantStreamer class to do so.
To get strand information on the variant, I automatically generate bed annotations by retrieving the geneID where the variant occurs (from VEP annotations). I fetch genomic coordinates of such genes, thus my final annotations refer to full gene length intervals that for sure span the target variants that occur within genes.
Since loading the entire genome takes quite a while, I want to use the same annotations generated before as the ROI to the
Bioseq
class. Basically, that's the code and the error I get:If I use instead
genome = Bioseq.create_from_refgenome('dna', refgenome=fasta_file, store_whole_genome=True)
, it works fine but it takes more than 20 minutes to load the genome in my laptop . Why is this happening, considering that the variant context fully overlaps the annotations?I'd also like to pose some additional questions:
predict_variant_effect
method on models previously trained based on Keras and Tensorflow ? I mean, It would be nice to create a Janggu instance from previously serialised h5 models.Thank you very much, Best, Pedro