Open martin-raden opened 2 years ago
the latter, ie removing all but the first duplicate, can be done using
system "grep '>' all.fas | uniq -d > duplicated_CDS.txt";
# remove duplicated CDS to the first (hopefully longest)
if (-s "duplicated_CDS.txt") {
# grep all duplicated CDS entries from FASTA file
system "for P in `cat duplicated_CDS.txt`; do grep -A1 -m1 \"\$P\" all.fas; done > duplicated_CDS.first.fa";
# compile a pattern to match all duplicated gene ids
# remove all duplicated entries from all.fas
system "PAT=\$(cat duplicated_CDS.txt | tr '\\n' '|'); cat all.fas | tr \"\\n\" \"#\" | sed \"s/#>/\\n>/g\" | grep -v -P \"\${PAT%|}\" | tr '#' '\\n' > all.fas.no-duplicates";
# join both files into a new all.fas
system "cat all.fas.no-duplicates duplicated_CDS.first.fa > all.fas";
}
which should be done BEFORE doing the domclust call, ie. after all.fas was created!
Hi Martin, its been years now but off the top of my head I think this may have been an issue with domclust. i.e. if duplicate CDS exist, then domclust fails. This is just a best guess.
Hi @PatrickRWright (happy new year, btw), thanks for the quick reply. So pruning duplicated entries to one should fix the issue right? Otherwise, we are a bit doomed, since even E.coli nowadays shows multiple CDS for some genes...
is that a sequence based check or an identifier based check?
On Thu, Jan 13, 2022 at 10:12 AM Martin Raden @.***> wrote:
Hi @PatrickRWright https://github.com/PatrickRWright (happy new year, btw), thanks for the quick reply. So pruning duplicated entries to one should fix the issue right? Otherwise, we are a bit doomed, since even E.coli nowadays shows multiple CDS for some genes...
— Reply to this email directly, view it on GitHub https://github.com/PatrickRWright/CopraRNA/issues/35#issuecomment-1011939866, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH44J4MZG4VTFJXCZPTMSADUV2JQRANCNFSM5LZNFFMQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
ID-based... it greps the CDS-FASTA-IDs and checks for duplicates within the all.fas
file (which is used in domclust etc)
eventually, it seems to me (at least for e.coli) that the genome file lists sometimes after the full CDS also subsequences as CDS (based on alternative start codons?)
Hi @PatrickRWright (happy new year, btw), thanks for the quick reply. So pruning duplicated entries to one should fix the issue right? Otherwise, we are a bit doomed, since even E.coli nowadays shows multiple CDS for some genes...
I would guess so but I'm not sure. How was this triggered? Were there errored runs?
we are currently migrating the webserver to a new cloud-based computation platform and thus have to reinstall everything. with that, I also updated the genome files, which now provides e.coli runs with an error due to the CDS duplicates. thus, I am currently working around it and would suggest to incorporate the change into the dev branch for integration into CopraRNA3...
Ok, so I think it depends on the extent of sequences that will need to be removed for CopraRNA to work from the technical side. How many duplicates are we talking about for E. coli? I think relevance of simply removing many duplicates can best be evaluated by @JensGeorg I would suggest that you have a look at what is duplicated and why and then assess what the consequences of "straight forward" removal would be. This requires biological domain knowledge. If the duplicates are few (<10) I assume the relevance in regard to results is minimal.
they are few (6 genes) and the suggestion from above doesnt remove all CDS from these genes but keeps the first CDS version (which is from first inspection the longest CDS covering the whole gene and not only parts of it)
so I would assume the impact small to non-existing but it keeps the workflow running and more robust.
@JensGeorg what do you think?
I think keeping one is better than removing them completely.
Hi @PatrickRWright @JensGeorg
CopraRNA checks for duplicated CDS within the all.fas file:
https://github.com/PatrickRWright/CopraRNA/blob/fdbca79d413b550bf5da62e0cd58fcc8f88c3da6/coprarna_aux/homology_intaRNA.pl#L341
why? what is the problem with them? I find that even the recent E.coli genome has duplicated CDS, like
can this be ignored? if not: can we just prune the all.fas to the first occurrence of each CDS?
thanks, Martin