ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
250 stars 32 forks source link

Re-build seq/cov5 99% nt redundancy set to prioritize RefSeqs #204

Closed rcedgar closed 3 years ago

rcedgar commented 3 years ago

With apologies to @rchikhi and @taltman, I overlooked a step in constructing the OTUs. We need to reconstruct the 99% redundant set of full-length genomes and their annotations to ensure that if a RefSeq genome is in the cluster then it becomes the exemplar sequence (Artem's term) / centroid sequence (my term).

@rcedgar: I will build the RefSeq-proritized set of complete genomes and fragments and update the 99.fa and 99.uc files in seq/cov5.

@rchikhi and @taltman: re-run annotations on the updated genomes.

ababaian commented 3 years ago

Don't we have annotations for all GenBank sequences by this point?

rcedgar commented 3 years ago

It's my understanding we chose to annotate only the 99% set to avoid a huge amount of redundancy, e.g. thousands of Cov2 genomes which are all the same from an annotation validation and OTU construction perspective.

rcedgar commented 3 years ago

@rchikhi @taltman RefSeq-prioritized files below are posted to s3://serratus-public/seq/cov5/. I need PFAM taxonomic domain alignments for sequences in both files. Thanks!

cov5_cg_id99.fa
toro5_cg_id99.fa
rchikhi commented 3 years ago

done. https://s3.console.aws.amazon.com/s3/buckets/serratus-public/seq/cov5/annotations/?region=us-east-1&tab=overview Some annotations of them are missing (around 18, which is same as yesterday - like the VADR bug that @tlatman reported to the devs) Also I've deleted yesterday's annotations that were in serratus-public/assemblies/annotations to avoid confusion.

taltman commented 3 years ago

@rchikhi, is there an easy way for me to see the per-AWS Batch instance stderr and stdout? Sifting through the combined stream on Cloud Front is not easy...

I basically want to categorize the reason for the 18 failing. Was it all the same VADR bug? Or something different for most of them?

taltman commented 3 years ago

@rchikhi In the mean-time, could you update #203 so that it shows the identity of the 18 that are still failing? Then I could run hmmsearch directly on them in the name of expediency.

rchikhi commented 3 years ago

@rchikhi In the mean-time, could you update #203 so that it shows the identity of the 18 that are still failing? Then I could run hmmsearch directly on them in the name of expediency.

to confirm: has it already been answered by Robert here https://github.com/ababaian/serratus/issues/203#issuecomment-660241894 or is it something dfferent?

taltman commented 3 years ago

I think that we're discussing the same set of 18, as in #203. As I updated there, I believe all but two entries worked, so @rcedgar should be free to proceed. Reassigning to @rcedgar to either clarify the work to be done, or to confirm the fix and close the issue.

rcedgar commented 3 years ago

Issue superseded by #211.