ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
254 stars 33 forks source link

Discrepancy in counting RdRp hits in SRR11648360 #228

Closed taltman closed 3 years ago

taltman commented 4 years ago

See issue #223.

When @asl runs Pfam on SRR11648360.coronaspades.gene_clusters.fa, he gets hundreds of hits.

When I run on s3://serratus-rayan/master_table_assemblies/SRR11648360.fa, I do not get any hits.

@asl, can you please post a S3 URI to the file you analyzed, or just upload it to this issue?

@rchikhi, can you please confirm that I'm looking at the right assembly file?

rchikhi commented 4 years ago

s3://serratus-rayan/master_table_assemblies/SRR11648360.fa is the right assembly file as it has been filtered for CoV hits (using BGC). This assembly only has 1 small cov hit. The Checkv-filtered assembly was empty.

In gene_clusters.fa (unfiltered for cov) there are many RdRP hits but since they're not on cov-filtered contigs, we discard them.

# Macro-domains - 122
# Peptidase_C30-domains - 2
# RdRP_1-domains - 1100
# Viral_helicase1-domains - 64

So bottom line: this dataset maybe has cov but certainly not all cov genes.