Closed ababaian closed 4 years ago
Dusting command is in ncbi-blast+ package, example command:
dustmasker -in mega_hv_covu.fa \
-outfmt fasta \
-out mega_hv_covu_softmasked.fa
I don't think it supports hard-masking, and I do think bowtie2 ignores soft-masking. If correct, you need to post-process the dust-masked file to convert lower-case to Ns.
Ideally this can all be done via a bed file which has coordinates for all masked positions in it. Then we have one hard accession black-list applied early, and one blacklist for regions, stored as a bed file. This is soft-masked to lowercase in cov3.fa and hard-masked in cov3.mask.fa, prior to generating reverse sequence controls.
For low-complexity masking I don't see the value of making a bed file. It is simpler just to run dustmaker every time the reference is re-generated.
False positives:
AX191449.1 is an exact nt match to AJ295749.2 Rattus norvegicus mRNA for xylosyltransferase II. AX191447.1 is an exact nt match to AJ295748.1Rattus norvegicus mRNA for xylosyltransferase I. HV449436.1 is an exact nt match to NM_022255.1Rattus norvegicus G-protein coupled receptor 173.
FYI, I found the false positives and the true positive pig Cov's by sorting summaries in order of decreasing detection score:
grep score *.summary | sed "-es/.summary:/ /" | sed "-es/score=//" | sed "-es/;//" | sort -rnk2 | less
The pigs all scored 100. Other datasets had high scores due to those three accessions, and in those datasets the rest of the alignments looked like junk. This prompted me to look more closely at the nt sequences. This shows the value of a summarizer-like tool for getting a very quick sense of what is in the reads.
@ababaian can we drop the reverse sequence controls? Not clear to me that they provide any help with FPs. Simpler without them.
Closing this issue as there was an earlier version in #64 which still wasn't resolved.
Starting with cov2r
Remove simple repeat sequences found by dusting
Command for this was discussed previously by Robert/Tomer, find it.
Additional blacklist entries
KC786228.1
which matches fungal rRNA and gives a very strong false-positive hit. See also #57From Robert. Reasons?