bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
992 stars 354 forks source link

Missing miRBase annotation when installing BDGP6 and dm3 genomes with smallrna datatarget #3112

Closed DrHogart closed 4 years ago

DrHogart commented 4 years ago

Hi, running bcbio_nextgen.py upgrade --genomes BDGP6 --datatarget smallrna results to the List of genomes to get (from the config file at '{'genomes': [{'dbkey': 'BDGP6', 'name': 'D melangogaster (BDGP6)', 'indexes': ['seq'], 'annotations': ['transcripts']}], 'genome_indexes': ['bwa', 'bowtie2', 'rtg', 'star'], 'install_liftover': False, 'install_uniref': False}'): D melangogaster (BDGP6). As you see there are no miRBase annotation files, only 'transcripts'. The same with dm3 genome. Correspondingly, there was no srnaseq folder after upgrading. So, smallRNA-seq analysis doesn't work. At the same time upgrade with smallrna datatarget for hg19 gets miRBase annotations correctly.

bcbio 1.2.0

Could you please add the srnaseq data in the BDGP6 genome resources?

naumenko-sa commented 4 years ago

Hi Sergei @DrHogart!

Thanks for reporting the issue!

While miRNA is mostly supported for human genome, I was able to run A.thaliana miRNA analysis once: https://github.com/bcbio/bcbio-nextgen/issues/1416

To push this analysis for Drosophila, we need to create a mirbase recipe in cloudbiolinux: https://github.com/chapmanb/cloudbiolinux/tree/master/ggd-recipes/BDGP6 similarly to hg19 for human https://github.com/chapmanb/cloudbiolinux/blob/master/ggd-recipes/hg19/mirbase.yaml or A.thaliana: https://github.com/chapmanb/cloudbiolinux/blob/master/ggd-recipes/TAIR10/mirbase.yaml

And then update resources.yaml in bcbio: https://github.com/bcbio/bcbio-nextgen/blob/master/config/genomes/BDGP6-resources.yaml similarly to A.thaliana https://github.com/bcbio/bcbio-nextgen/blob/master/config/genomes/TAIR10-resources.yaml

If you could help creating and testing this recipe (making sure all files and downloaded and they correspond to the correct reference) with a pull request (PR), that would speed up the process.

Here is how to test a recipe: https://github.com/chapmanb/cloudbiolinux/blob/master/doc/hacking.md#testing-a-ggd-recipe

Sergey

naumenko-sa commented 4 years ago

Thanks for the PRs! bcbio installs srnaseq for DBGP6 now for me.

DrHogart commented 4 years ago

For me bcbio also installs srnaseq, but seqbuster, mirdeep2 still doesn't work... I've just realized that BDGP6 genome file is from ensembl and their chrom names are '2L', '2R' and so, while srna-transcripts.gff, mirbase.gff3 and other files from srnaseq have 'chr2L', 'chr2R' and so. Is it possible that this discrepancy can be the reason? I can check this only tomorrow.

naumenko-sa commented 4 years ago

yes, I think it is better to have chr names matching the reference. See some chromosome mapping helper scripts: https://github.com/chapmanb/cloudbiolinux/blob/master/ggd-recipes/hg19/topmed.yaml S

DrHogart commented 4 years ago

but seqbuster, mirdeep2 still doesn't work.

it was just the indentation bug in the config/genomes/BDGP6-resources.yaml, please see the PR. Now, for me smallRNA-seq analysis goes well.

yes, I think it is better to have chr names matching the reference.

Since smallRNA-seq generates its results correctly (as for as I understand them), I left the chrom names the same and they don't match the chrom names of the reference. Please let me know, if I should change them (e.g. to make some consistency with general rules of the repo policies or so)

naumenko-sa commented 4 years ago

Thanks @DrHogart !

I am fine with leaving chr names as is for now, since it produces the right results. In the variant analysis chr/nochr was always an issue, but in atac-seq, mirna tools may tolerate that difference. I saw Lorena was just linking recipes for grch38/hg19, so it worked before for H.sapiens. We document it here, if anybody sees any issues in BDGP6/mirna please re-open this one.