bonsai-team / matam

Mapping-Assisted Targeted-Assembly for Metagenomics
GNU Affero General Public License v3.0
19 stars 9 forks source link

Read alignment SAM file is not sorted correctly prior to score filtering #88

Open ppericard opened 4 years ago

ppericard commented 4 years ago

The alignment filtering step should sort the SAM file by read id and alignment score before filtering. Right now the sorting is performed by the following command:

2019-07-29 13:47:23,161 - root - DEBUG - CMD: cat /media/data/test_matam/matam_dev_assembly/workdir/16sp.art_HS25_pe_100bp_50x.sortmerna_vs_SILVA_128_SSURef_NR95_b10_m10.sam | grep -v "^@" | sort -T /media/data/test_matam/matam_dev_assembly/workdir -S 10000M --parallel 6 -k 1,1V -k 12,12nr | /media/data/matam_dev/scripts/filter_score_multialign.py -t 0.9 --geometric > /media/data/test_matam/matam_dev_assembly/workdir/16sp.art_HS25_pe_100bp_50x.sortmerna_vs_SILVA_128_SSURef_NR95_b10_m10.scr_filt_geo_90pct.sam

However, -k 12,12nr doesnt work correctly on fields like AS:i:195 and the resulting file is not correctly sorted (at least on Ubuntu 18.04).

To sort correctly, we should be using the version sort -k 12,12Vr. However, I'm suspecting this wont work for all sort versions, and might also depend on the locale variables