Illumina / Nirvana

The nimble & robust variant annotator
https://illumina.github.io/NirvanaDocumentation/
GNU General Public License v3.0
170 stars 44 forks source link

Additional species #41

Closed rahulvrane closed 8 months ago

rahulvrane commented 3 years ago

Hello We are keen to use Nirvana for non-model species, but it doesnt seem to be available by default. Is there a way to build additional models for other species? We have ample data to run any training as needed.

Thanks a ton Cheers R

Andy-B-123 commented 3 years ago

Hi, I would second this request! The program and documentation look excellent and would be very useful for a well documented organism I am working with. Some functionality to build a database for an organism with a FASTA file and GFF would be brilliant :-)

MichaelStromberg commented 3 years ago

Wow! Normally our team just focuses on annotating human genomes (and recently the SARS-CoV-2 genome that causes COVID-19), but this is definitely something we can enable.

By allowing users to bring in a FASTA file (for the reference) and a GFF file (for the gene models), we could enable anyone to start annotating their favorite species. For external data sets, you can already use our custom annotation functionality to enable those.

The one gap that I see is handling transcript sequences that are different from the reference genome. For example, if you were to build your cDNA or coding sequence (CDS) using the intervals defined in the GFF, it would still lead to a fair amount of error. One of the hallmarks of using RefSeq as a transcript source is that they define a unique transcript sequence when compared to the reference genome. To be perfect, we would need to introduce a way of specifying how each exon should be aligned to the genome (subsitutions, insertions, and deletions).

The NCBI publishes a BAM file (ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_knownrefseq_alignments.bam) for the human genome that captures these types of alignments. These seem to exist for other model organisms like B. taurus (ftp://ftp.ncbi.nlm.nih.gov/refseq/B_taurus/annotation_releases/current/GCF_002263795.1_ARS-UCD1.2/GCF_002263795.1_knownrefseq_alns.bam) and D. rerio(ftp://ftp.ncbi.nlm.nih.gov/refseq/D_rerio/annotation_releases/current/GCF_000002035.6_GRCz11/RefSeq_transcripts_alignments/GCF_000002035.6_GRCz11_knownrefseq_alns.bam).

I'll talk to the team about supporting GFF + FASTA to begin with. From there, we can start thinking about how to add the BAM alignments to handle the last bit.

rahulvrane commented 3 years ago

Thanks for this Michael. We would be more than happy to provide refseq-like BAM's and GFF's where needed for testing and debugging! Very keen to see how you go!

Cheers Rahul

Redmar-van-den-Berg commented 2 years ago

Is there any news on this front? I like to test my pipelines with some data for chrM, so it would be great to be able to use Nirvana with a small database that only contains chrM, rather than the 20GB default one.

geoffjentry commented 8 months ago

I'm assuming the lack of update here means no one has pushed this forward at all but I'm also interested to know if there's been any further thought on this topic.

rahulvrane commented 8 months ago

I believe this was already incorporated as a feature

geoffjentry commented 8 months ago

@rahulvrane I came across this via searching due to not seeing evidence of supporting other species in the documentation. Can you point me to the appropriate docs and/or an example?