Arcadia-Science / 2022-tx-not-in-gx

MIT License
0 stars 0 forks source link

Build a 100k cover db of all genomes and subtract using that instead of organism-specific databases? #5

Open taylorreiter opened 1 year ago

taylorreiter commented 1 year ago

As mentioned in #4, this pipeline is supposed to target any distinct transcriptome sequence, meaning anything that has never been observed in any DNA before. (I don't think we need to worry about RNA contaminating DNA...since rtPCR isn't a step in DNA sequencing, I think any RNA contam would just...disapper? I should check this). If we are targeting truly unique RNA mods (splicing, editing, nucleic acid modifications, ...), I think it would be ok to subtract all DNA sequences ever.

This would also probably remove contamination, which would be nice (see #4).

This would also have the benefit of allowing us to include all of the k-mers from a given species as a cover, which might reduce problems with heterozygosity (see #3). It will also dramatically simplify the encoding of the workflow -- we wouldn't need to know what organism we were working with before doing the subtraction, and we wouldn't need to specify which genome to subtract.

See https://github.com/ctb/2022-database-covers and https://github.com/sourmash-bio/sourmash/issues/1852 for cover DB construction. I would want to work with the latest genbank, as well as Eukaryotic genbank. For Euk genbank, I'd need to download and sketch all of the genomes (except protists and fungi which are already in the genbank db). (Note that here including repeat elements and contamination wouldn't cause problems 🎉 )

Also note that over in https://github.com/greenelab/2022-microberna/, I already sketched all of the microbial transcriptomes on the SRA and all of the signatures are CCBy so...

I don't think this would scale to metagenomes though yet...although I don't know how scaled100k would improve the size of that.

taylorreiter commented 1 year ago

if i do this, I should benchmark running gather with a threshold bp of 0 and outputting unassigned vs. subtracting using the zip of the db. I don't have a good guess as to which would be faster (I'm also not sure if output unassigned keeps the abundances of query signature)

Also it would be fun to have these profiles for everything (although they could be wrong in weird ways with repetitive elements and such).

taylorreiter commented 1 year ago

I just read the quote, "In mammals, editing mostly occurs within genomic repeats (Bazak et al., 2014a, Levanon et al., 2004, Neeman et al., 2006). In primates specifically, most RNA-editing sites are found in Alu repeats, whose sequence facilitates the creation of a double-stranded RNA structure that promotes ADAR binding. Similarly, editing in Octopus bimaculoides is enriched in repeat regions (303,414/903,742 sites, 34%; 159,005 of them in annotated repeats)." (source: https://www.sciencedirect.com/science/article/pii/S0092867417303446). So a cover of eukaryotic genomes (or any genome with lots of repeats) might not have the desired effect. I'd be concerned that I would accidentally subtract an edited site because the edited sequence occurred in a different genome.