ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
258 stars 34 forks source link

cov3r pan-genome refinements #76

Closed ababaian closed 4 years ago

ababaian commented 4 years ago

Starting with cov2r

Remove simple repeat sequences found by dusting

Command for this was discussed previously by Robert/Tomer, find it.

Additional blacklist entries

From Robert. Reasons?

rcedgar commented 4 years ago

Dusting command is in ncbi-blast+ package, example command:

dustmasker -in mega_hv_covu.fa  \
  -outfmt fasta \
  -out mega_hv_covu_softmasked.fa

I don't think it supports hard-masking, and I do think bowtie2 ignores soft-masking. If correct, you need to post-process the dust-masked file to convert lower-case to Ns.

ababaian commented 4 years ago

Ideally this can all be done via a bed file which has coordinates for all masked positions in it. Then we have one hard accession black-list applied early, and one blacklist for regions, stored as a bed file. This is soft-masked to lowercase in cov3.fa and hard-masked in cov3.mask.fa, prior to generating reverse sequence controls.

rcedgar commented 4 years ago

For low-complexity masking I don't see the value of making a bed file. It is simpler just to run dustmaker every time the reference is re-generated.

rcedgar commented 4 years ago

False positives:

AX191449.1 is an exact nt match to AJ295749.2 Rattus norvegicus mRNA for xylosyltransferase II. AX191447.1 is an exact nt match to AJ295748.1Rattus norvegicus mRNA for xylosyltransferase I. HV449436.1 is an exact nt match to NM_022255.1Rattus norvegicus G-protein coupled receptor 173.

rcedgar commented 4 years ago

FYI, I found the false positives and the true positive pig Cov's by sorting summaries in order of decreasing detection score:

grep score *.summary | sed "-es/.summary:/ /" | sed "-es/score=//" | sed "-es/;//" | sort -rnk2 | less

The pigs all scored 100. Other datasets had high scores due to those three accessions, and in those datasets the rest of the alignments looked like junk. This prompted me to look more closely at the nt sequences. This shows the value of a summarizer-like tool for getting a very quick sense of what is in the reads.

rcedgar commented 4 years ago

@ababaian can we drop the reverse sequence controls? Not clear to me that they provide any help with FPs. Simpler without them.

ababaian commented 4 years ago

Closing this issue as there was an earlier version in #64 which still wasn't resolved.