christophertbrown / iRep

scripts for estimating bacteria replication rates based on population genome copy number variation
MIT License
68 stars 9 forks source link

Different result if I use sorted SAM file as an input or use the unsorted SAM file as an input #31

Closed shashankx closed 4 years ago

shashankx commented 4 years ago

Hi, I am trying to use the iRep, I have first used the unsorted SAM file and use the parameter- iRep -f Refined_bins/* -s sam/18097D-02-01.sam --sort -t 10 -o 18097D-02-01_ouput1

and then I used the sorted SAM file - iRep -f Refined_bins/* -s sam/18097D-02-01_sorted.sam -t 10 -o 18097D-02-01_ouput2

In both the cases I got the output as expected, but slightly different results-

(qiime2-2019.7) [s176365@computerome01 GrowthDynamics_Project]$ head *.tsv

==> 18097D-02-01_output1.tsv <==

index of replication (iRep) - thresholds: min cov. = 5, min wins. = 0.98, min r^2 = 0.9, max fragments/Mbp = 175, GC correction min r^2 = 0.0

genome sam/18097D-02-01.sam

Refined_bins/metawrap_50_10_bins/bin.10.fa 1.8196392224225502 Refined_bins/metawrap_50_10_bins/bin.11.fa 1.2924720480420955 Refined_bins/metawrap_50_10_bins/bin.12.fa 1.8852428770409324 Refined_bins/metawrap_50_10_bins/bin.13.fa 1.282351579973846 Refined_bins/metawrap_50_10_bins/bin.14.fa n/a Refined_bins/metawrap_50_10_bins/bin.15.fa 1.336003763633216 Refined_bins/metawrap_50_10_bins/bin.16.fa 1.6011765116142753

==> 18097D-02-01_output2.tsv <==

index of replication (iRep) - thresholds: min cov. = 5, min wins. = 0.98, min r^2 = 0.9, max fragments/Mbp = 175, GC correction min r^2 = 0.0

genome sam/18097D-02-01.sam.sorted.sam

Refined_bins/metawrap_50_10_bins/bin.10.fa 1.8813054923644446 Refined_bins/metawrap_50_10_bins/bin.11.fa 1.298804163590093 Refined_bins/metawrap_50_10_bins/bin.12.fa 1.9381343015530244 Refined_bins/metawrap_50_10_bins/bin.13.fa 1.2960729300999423 Refined_bins/metawrap_50_10_bins/bin.14.fa n/a Refined_bins/metawrap_50_10_bins/bin.15.fa 1.3425165432843467 Refined_bins/metawrap_50_10_bins/bin.16.fa 1.6262016201595502 Refined_bins/metawrap_50_10_bins/bin.17.fa 1.3584872237631114

Which way is correct? Any suggestion?

nigiord commented 4 years ago

iRep uses its own custom-made parser to parse the SAM file, and it explicitly expects reads to be sorted by name.

From the README:

If the SAM file is not sorted, be sure to either re-run Bowtie with the --reorder flag, or use the --sort option to sort the SAM file; otherwise, the scripts will make incorrect choices about which mapped reads to include when calculating coverage.

Make also sure that reads from the same pair are always present in the SAM file, even if the second read was not mapped (see this discussion). This is due to the peculiar way the parser works (always expecting read n+1 to be paired with read n, so the "reading frame" should not be shifted).

−Nils

shashankx commented 4 years ago

Hi Nils, I have used the following command to sort the SAM file-

for f in *.sam
do
samtools sort -@ 15 $f -O sam -o $f.sorted.sam
done

I suppose this will sort the samples.

nigiord commented 4 years ago

By default, samtools sorts by position on the reference ([documentation] (http://www.htslib.org/doc/samtools.html)) which is not what iRep expects. To sort by read name, you need to add the -n option.

−Nils

christophertbrown commented 4 years ago

Thanks, Nils for your earlier responses.

I agree with you that samtools sort with -n should provide the same result. The only reason for a difference that I can think of is that there is something going on with orphaned paired reads (as we have discussed before, and which you referenced).

Shanky - if you are still in doubt, I would recommend re-mapping using bowtie2 and the —reorder flag to make sure that the sam file is in the correct format, and see how that compares with your other results.

Chris

On Sep 23, 2019, at 9:17 AM, Shanky notifications@github.com wrote:  I did sorting using -n parameters, then ran irep and got the same results as observed previously.

samtools sort -@ 20 sam/18097D-02-01.sam -O sam -o sam/18097D-02-01.soted.sam iRep -f Refined_Bins/fastas/18097D-02-01_bin.* -s sam/18097D-02-01.soted.sam -t 15 -o irep_test/18097D-02-01_NEW_With_SORTED

So, I am slightly confused now, should I use the one which I sorted using -n parameter and then perform iRep or should I use the iRep its own sorting using --sort option.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.