Closed rcedgar closed 3 years ago
Don't we have annotations for all GenBank sequences by this point?
It's my understanding we chose to annotate only the 99% set to avoid a huge amount of redundancy, e.g. thousands of Cov2 genomes which are all the same from an annotation validation and OTU construction perspective.
@rchikhi @taltman RefSeq-prioritized files below are posted to s3://serratus-public/seq/cov5/
. I need PFAM taxonomic domain alignments for sequences in both files. Thanks!
cov5_cg_id99.fa
toro5_cg_id99.fa
done.
https://s3.console.aws.amazon.com/s3/buckets/serratus-public/seq/cov5/annotations/?region=us-east-1&tab=overview
Some annotations of them are missing (around 18, which is same as yesterday - like the VADR bug that @tlatman reported to the devs)
Also I've deleted yesterday's annotations that were in serratus-public/assemblies/annotations
to avoid confusion.
@rchikhi, is there an easy way for me to see the per-AWS Batch instance stderr and stdout? Sifting through the combined stream on Cloud Front is not easy...
I basically want to categorize the reason for the 18 failing. Was it all the same VADR bug? Or something different for most of them?
@rchikhi In the mean-time, could you update #203 so that it shows the identity of the 18 that are still failing? Then I could run hmmsearch directly on them in the name of expediency.
@rchikhi In the mean-time, could you update #203 so that it shows the identity of the 18 that are still failing? Then I could run hmmsearch directly on them in the name of expediency.
to confirm: has it already been answered by Robert here https://github.com/ababaian/serratus/issues/203#issuecomment-660241894 or is it something dfferent?
I think that we're discussing the same set of 18, as in #203. As I updated there, I believe all but two entries worked, so @rcedgar should be free to proceed. Reassigning to @rcedgar to either clarify the work to be done, or to confirm the fix and close the issue.
Issue superseded by #211.
With apologies to @rchikhi and @taltman, I overlooked a step in constructing the OTUs. We need to reconstruct the 99% redundant set of full-length genomes and their annotations to ensure that if a RefSeq genome is in the cluster then it becomes the exemplar sequence (Artem's term) / centroid sequence (my term).
@rcedgar: I will build the RefSeq-proritized set of complete genomes and fragments and update the 99.fa and 99.uc files in seq/cov5.
@rchikhi and @taltman: re-run annotations on the updated genomes.