what's the problem with duplicated CDS ?

martin-raden commented 2 years ago

Hi @PatrickRWright @JensGeorg

CopraRNA checks for duplicated CDS within the all.fas file:

https://github.com/PatrickRWright/CopraRNA/blob/fdbca79d413b550bf5da62e0cd58fcc8f88c3da6/coprarna_aux/homology_intaRNA.pl#L341

why? what is the problem with them? I find that even the recent E.coli genome has duplicated CDS, like

bash-4.2$ bzgrep -A10 -B10 b0149 ~/data/ncbi-refseq-gbk/NC_000913.gb.bz2
                     LQMLGALEGERLSAQGQKMAALGNDPRLAAMLVSAKNDDEAATAAKIAAILEEPPRMG
                     NSDLGVAFSRNQPAWQQRSQQLLKRLNVRGGEADSSLIAPLLAGAFADRIARRRGQDG
                     RYQLANGMGAMLDANDALSRHEWLIAPLLLQGSASPDARILLALLVDIDELVQRCPQL
                     VQQSDTVEWDDAQGTLKAWRRLQIGQLTVKVQPLAKPSEDELHQAMLNGIRDKGLSVL
                     NWTAEAEQLRLRLLCAAKWLPEYDWPAVDDESLLAALETWLLPHMTGVHSLRGLKSLD
                     IYQALRGLLDWGMQQRLDSELPAHYTVPTGSRIAIRYHEDNPPALAVRMQEMFGEATN
                     PTIAQGRVPLVLELLSPAQRPLQITRDLSDFWKGAYREVQKEMKGRYPKHVWPDDPAN
                     TAPTRRTKKYS"
     gene            164730..167264
                     /gene="mrcB"
                     /locus_tag="b0149"
                     /gene_synonym="ECK0148; pbpF; ponB"
                     /db_xref="ASAP:ABE-0000516"
                     /db_xref="ECOCYC:EG10605"
                     /db_xref="GeneID:944843"
     CDS             164730..167264
                     /gene="mrcB"
                     /locus_tag="b0149"
                     /gene_synonym="ECK0148; pbpF; ponB"
                     /EC_number="2.4.1.129"
                     /codon_start=1
                     /transl_table=11
                     /product="peptidoglycan glycosyltransferase/peptidoglycan
                     DD-transpeptidase MrcB"
                     /protein_id="NP_414691.1"
                     /db_xref="UniProtKB/Swiss-Prot:P02919"
                     /db_xref="ASAP:ABE-0000516"
                     /db_xref="ECOCYC:EG10605"
--
                     MLSARPLGVQPRGGVISPQPAFMQLVRQELQAKLGDKVKDLSGVKIFTTFDSVAQDAA
                     EKAAVEGIPALKKQRKLSDLETAIVVVDRFSGEVRAMVGGSEPQFAGYNRAMQARRSI
                     GSLAKPATYLTALSQPKIYRLNTWIADAPIALRQPNGQVWSPQNDDRRYSESGRVMLV
                     DALTRSMNVPTVNLGMALGLPAVTETWIKLGVPKDQLHPVPAMLLGALNLTPIEVAQA
                     FQTIASGGNRAPLSALRSVIAEDGKVLYQSFPQAERAVPAQAAYLTLWTMQQVVQRGT
                     GRQLGAKYPNLHLAGKTGTTNNNVDTWFAGIDGSTVTITWVGRDNNQPTKLYGASGAM
                     SIYQRYLANQTPTPLNLVPPEDIADMGVDYDGNFVCSGGMRILPVWTSDPQSLCQQSE
                     MQQQPSGNPFDQSSQPQQQPQQQPAQQEQKDSDGVAGWIKDMFGSN"
     CDS             164865..167264
                     /gene="mrcB"
                     /locus_tag="b0149"
                     /gene_synonym="ECK0148; pbpF; ponB"
                     /codon_start=1
                     /transl_table=11
                     /product="PBP1Bgamma"
                     /protein_id="YP_010051172.1"
                     /db_xref="ASAP:ABE-0000516"
                     /db_xref="ECOCYC:EG10605"
                     /db_xref="GeneID:944843"
                     /translation="MPRKGKGKGKGRKPRGKRGWLWLLLKLAIVFAVLIAIYGVYLDQ
                     KIRSRIDGKVWQLPAAVYGRMVNLEPDMTISKNEMVKLLEATQYRQVSKMTRPGEFTV

can this be ignored? if not: can we just prune the all.fas to the first occurrence of each CDS?

thanks, Martin

martin-raden commented 2 years ago

the latter, ie removing all but the first duplicate, can be done using

system "grep '>' all.fas | uniq -d > duplicated_CDS.txt";
# remove duplicated CDS to the first (hopefully longest)
if (-s "duplicated_CDS.txt") {
  # grep all duplicated CDS entries from FASTA file
  system "for P in `cat duplicated_CDS.txt`; do grep -A1 -m1 \"\$P\" all.fas; done > duplicated_CDS.first.fa";
  # compile a pattern to match all duplicated gene ids
  # remove all duplicated entries from all.fas
  system "PAT=\$(cat duplicated_CDS.txt | tr '\\n' '|'); cat all.fas | tr \"\\n\" \"#\" | sed \"s/#>/\\n>/g\" | grep -v -P  \"\${PAT%|}\" | tr '#' '\\n'  > all.fas.no-duplicates";
  # join both files into a new all.fas
  system "cat all.fas.no-duplicates duplicated_CDS.first.fa > all.fas";
}

which should be done BEFORE doing the domclust call, ie. after all.fas was created!

PatrickRWright commented 2 years ago

Hi Martin, its been years now but off the top of my head I think this may have been an issue with domclust. i.e. if duplicate CDS exist, then domclust fails. This is just a best guess.

martin-raden commented 2 years ago

Hi @PatrickRWright (happy new year, btw), thanks for the quick reply. So pruning duplicated entries to one should fix the issue right? Otherwise, we are a bit doomed, since even E.coli nowadays shows multiple CDS for some genes...

JensGeorg commented 2 years ago

is that a sequence based check or an identifier based check?

On Thu, Jan 13, 2022 at 10:12 AM Martin Raden @.***> wrote:

Hi @PatrickRWright https://github.com/PatrickRWright (happy new year, btw), thanks for the quick reply. So pruning duplicated entries to one should fix the issue right? Otherwise, we are a bit doomed, since even E.coli nowadays shows multiple CDS for some genes...

— Reply to this email directly, view it on GitHub https://github.com/PatrickRWright/CopraRNA/issues/35#issuecomment-1011939866, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH44J4MZG4VTFJXCZPTMSADUV2JQRANCNFSM5LZNFFMQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

martin-raden commented 2 years ago

ID-based... it greps the CDS-FASTA-IDs and checks for duplicates within the all.fas file (which is used in domclust etc)

martin-raden commented 2 years ago

eventually, it seems to me (at least for e.coli) that the genome file lists sometimes after the full CDS also subsequences as CDS (based on alternative start codons?)

PatrickRWright commented 2 years ago

Hi @PatrickRWright (happy new year, btw), thanks for the quick reply. So pruning duplicated entries to one should fix the issue right? Otherwise, we are a bit doomed, since even E.coli nowadays shows multiple CDS for some genes...

I would guess so but I'm not sure. How was this triggered? Were there errored runs?

martin-raden commented 2 years ago

we are currently migrating the webserver to a new cloud-based computation platform and thus have to reinstall everything. with that, I also updated the genome files, which now provides e.coli runs with an error due to the CDS duplicates. thus, I am currently working around it and would suggest to incorporate the change into the dev branch for integration into CopraRNA3...

PatrickRWright commented 2 years ago

Ok, so I think it depends on the extent of sequences that will need to be removed for CopraRNA to work from the technical side. How many duplicates are we talking about for E. coli? I think relevance of simply removing many duplicates can best be evaluated by @JensGeorg I would suggest that you have a look at what is duplicated and why and then assess what the consequences of "straight forward" removal would be. This requires biological domain knowledge. If the duplicates are few (<10) I assume the relevance in regard to results is minimal.

martin-raden commented 2 years ago

they are few (6 genes) and the suggestion from above doesnt remove all CDS from these genes but keeps the first CDS version (which is from first inspection the longest CDS covering the whole gene and not only parts of it)

so I would assume the impact small to non-existing but it keeps the workflow running and more robust.

@JensGeorg what do you think?

PatrickRWright commented 2 years ago

I think keeping one is better than removing them completely.

PatrickRWright / CopraRNA

what's the problem with duplicated CDS ? #35