Closed kneubehl closed 5 months ago
Hi,
There are quite a lot of issues to unpack with prokaryotes and it is not something bambu is optimized for. Namely this occurs because of the messy start and ends of long reads making it hard to separate single-exons transcripts from each other, and this is compounded in prokarytotes which have very compact genomes with overlapping genes.
Despite this, if you wanted to try use operon ids instead, I would recommend manually placing the operon id into the gene id field in the input gtf. While I havn't tested this, I believe this should work fine for quantification however transcript discovery will be problematic. You would need to turn on single-exon discovery but I am not sure it would provide any meaningful results.
Sorry that I cannot be of more help here, it is something we want to improve in the future but will require a significant undertaking.
Kind Regards, Andre Sim
I figured it was a long shot but I thought I'd reach out and see. Thank you for your insights!
Regards,
Alex Kneubehl, PhD
Postdoctoral Associate Translational Virology Lab Vector Biology and Bacterial Pathogens Lab Division of Tropical Medicine
Department of Pediatrics
Baylor College of Medicine
Twitter: @AlexKneubehl
From: Andre Sim @.> Sent: Monday, June 17, 2024 8:24 PM To: GoekeLab/bambu @.> Cc: KNEUBEHL, Alexander Robert @.>; Author @.> Subject: Re: [GoekeLab/bambu] Use with prokaryotes to ID operons(?) (Issue #431)
CAUTION: This email is not from a BCM Source. Only click links or open attachments you know are safe.
Hi,
There are quite a lot of issues to unpack with prokaryotes and it is not something bambu is optimized for. Namely this occurs because of the messy start and ends of long reads making it hard to separate single-exons transcripts from each other, and this is compounded in prokarytotes which have very compact genomes with overlapping genes.
Despite this, if you wanted to try use operon ids instead, I would recommend manually placing the operon id into the gene id field in the input gtf. While I havn't tested this, I believe this should work fine for quantification however transcript discovery will be problematic. You would need to turn on single-exon discovery but I am not sure it would provide any meaningful results.
Sorry that I cannot be of more help here, it is something we want to improve in the future but will require a significant undertaking.
Kind Regards, Andre Sim
— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_GoekeLab_bambu_issues_431-23issuecomment-2D2174730955&d=DwMCaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=99MiTEeG3M-IttXHprvlx9ngdHgRO4pb0R_qcVHJt6Q&m=-23Fhb7MN0J5eYQBvbVZbV2s3R99uzHCQEIOPum7iJNHr-WWuAl01FhVL2rFK-RC&s=sdAeAf4D-Y__5pXFKh9msDCwCtonrx73CiMIrlQ2S3w&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AOBL37FF26TV72QVGGZ6M6TZH6D6BAVCNFSM6AAAAABI7DZR4WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZUG4ZTAOJVGU&d=DwMCaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=99MiTEeG3M-IttXHprvlx9ngdHgRO4pb0R_qcVHJt6Q&m=-23Fhb7MN0J5eYQBvbVZbV2s3R99uzHCQEIOPum7iJNHr-WWuAl01FhVL2rFK-RC&s=TuAU0n2q9JqiE8-TJpa0H9N97LKHmnR_zWA9L7bZF74&e=. You are receiving this because you authored the thread.Message ID: @.***>
Hi there, This is a question not an issue. I was wondering if it possible that bambu could be used to ID operons in bacterial long-read cDNA sequencing. Conceptually, to me at least, an operon and exon usage could be similar at a gross scale. Retention of certain genes in an operon in similar genomic space could be like exon retention though the length of "introns" (i.e. intergenic space) would likely be a lot smaller and there would be no defined splice junctions. I am wondering if using the single exon settings might help to flush out the operons which could then be used to help better quantify gene expression. Compared to short-read RNAseq quantification tools I am worried about multi-mapping issues for the long cDNA reads since they would effectively contain multiple genes thus mapping to multiple genome annotations. I feel like isoform discovery could take all this into account. Or maybe I am way off. Your thoughts on this would be appreciated.