Some benchmark remarks - Githubissues

ksahlin commented 1 year ago

Hi @ivargr,

First off, this is a great effort. Thank you for this resource!

I was a bit hesitant to advertise it on twitter yet because I found at least one potential issue with your draft benchmarking results.

I think the rounding of r to the strobealign rule is wrong for some lengths, which might lead to suboptimal results, at least for read lengths of 75. Strobealign will estimate the read length (and canonical read length) itself, so I think simply "strobealign -t {wildcards.n_threads} --use-index {input.ref} {input.reads} | " should suffice. Strobealign will recognize the index (@marcelm - correct me if I am wrong), which, e.g., for 75nt reads should be r50.sti since the canonical read length will be 50. With your rounding, I think strobealign will use 100. In all cases, the indexing and mapping could use the unmodified {wildcards.r}.
I am suspicious about the runtime results, which show that minimap2 and strobealign are slower than Bowtie2 and BWA-MEM. I have a difficult time imagining this would happen unless there are very few reads so that reading the index from file takes longer than mapping the reads. Reading the index from file might be 30sec at most for strobealign and minimap2. see that the reads are mapped in gz format, but I don't think that should slow the mapping process down (@marcelm?).

I will take a further look as soon as I have time. Again, really nice work! I will definitely try your benchmark out.

ksahlin commented 1 year ago

Oh, another more likely explanation for point 2 is that there are too few reads, so minimap2 and strobealign only use some of the threads. Strobealign previously used a batch size of 100k reads per thread, but v0.8.0 uses 10k reads. Did you run v0.8.0? The fact that there is almost no drop in runtime for strobealign with increasing number of cores looks suspicious.

ksahlin commented 1 year ago

(Also, in case convenient, read group can be added from command line with --rg-id in v0.8.0 https://github.com/ksahlin/strobealign/blob/main/CHANGES.md#features)

ivargr commented 1 year ago

Thanks for pointing out!

I was a bit unsure about the implementation of strobealign, and was going to ask you about this. I should have done it before putting these results out (sorry for that!). I've just removed strobealign now from the automatic plots until I know that it's been implemented correctly.

The problem I faced was Snakemake-related, in that when just specifying --use-index the Snakemake rule does not know what index will be required (since Strobealign rounds the read length). So i just tried to "predict" what the read length would be (it seemed that Strobealign rounded it to the nearest 50 -- is that right?). I guess maybe one could let the indexing be part of the mapping-rule. That will increase the measured run-time, but probably doesn't matter too much if there are enough reads.

And yes, regarding mapping time, from my general experience Minimap2 and Strobealign should be much faster than bwa mem, so I think that's because there are too few reads, as you point out. I'll update the config now with more reads.

ksahlin commented 1 year ago

I was a bit unsure about the implementation of strobealign, and was going to ask you about this. I should have done it before putting these results out (sorry for that!). I've just removed strobealign now from the automatic plots until I know that it's been implemented correctly.

No worries! Really happy you put the work in.

The problem I faced was Snakemake-related, in that when just specifying --use-index the Snakemake rule does not know what index will be required (since Strobealign rounds the read length). So i just tried to "predict" what the read length would be

I see, now I get why this is annoying..

(it seemed that Strobealign rounded it to the nearest 50 -- is that right?).

It is not exacly that simple anymore, here is the current projection table

I think this could work:

def strobealign_r(read_length):

    if int(read_length) <= 90:
        return 50 
    elif int(read_length) <= 110:
        return 100 
    elif int(read_length) <= 135:
        return 125
    elif int(read_length) <= 175:
        return 150
    elif int(read_length) <= 275:
        return 250
    elif int(read_length) <= 375:
        return 300
    else:
        return 400

ivargr commented 1 year ago

Thanks, I'll try that out!

Meanwhile, I've updated the config so that it now runs on 2 mill reads when checking runtime, I guess that starts to be enough reads to get a more correct picture:

marcelm commented 1 year ago

Reading the index from file might be 30sec at most for strobealign and minimap2

Yes. On our cluster nodes (which are a couple of years old), for a 3 Gbp genome, I measured 20 seconds total: 5 seconds for the FASTA file plus 15 seconds for the index itself (if everything is in cache).

[I] see that the reads are mapped in gz format, but I don't think that should slow the mapping process down (@marcelm?).

It increases runtime by about 2%, but I think it is more realistic to benchmark using compressed FASTQs as that is what we nearly always have in practice. There are different ways to do the decompression and if a read mapper chooses a slow decompression method, that will influence runtime in practice and should be reflected in benchmarks (IMO).

Of course all tools should be benchmarked on compressed input.

Regarding the .sti files: I don’t recommend re-implementing the strobealign rules for deriving the canonical read length as this is an implementation detail that is subject to change. To solve the Snakemake-specific problem that you cannot easily write a rule where you don’t know the output file name, there used to be a feature called "dynamic outputs", but this has been replaced a while ago with checkpoints. I’ve never used it and admit it looks a bit complicated for such a basic thing.

One alternative is to let the strobealign rule not depend on the .sti file itself, but on a marker file that is created by the rule that creates the index. The name of the marker file should then include the name of the FASTQ input and not the read length. Roughly like this:

rule create_strobealign_index:
  output: marker="{ref}.{sample}.strobealign-index-created"
  input: fasta="{ref}.fasta", fastq="{sample}.fastq.gz"
  shell: "strobealign --create-index {input.ref} {input.fastq} && touch {output.marker}"

rule map_with_strobealign:
  output: ...
  input: marker="{ref}.{sample}.strobealign-index-created"
...

Another comment: I noticed you use ... | samtools view -b -h - > out.bam. The -h is implied for -b (BAM output), so you can omit it. Actually, the more "modern" style would be to use -o and let samtools autodetect the format: ... | samtools view -o out.bam. (And if you write anything that is used in production, just pipe directly into samtools sort instead of samtools view, which is a lot faster than sorting in a separate step.)

ksahlin commented 1 year ago

Meanwhile, I've updated the config so that it now runs on 2 mill reads when checking runtime, I guess that starts to be enough reads to get a more correct picture

I think it looks better! However, to really see the time difference and scaling with threads, I would use enough reads so that all methods has the chance to show their (near half) reduction in runtime when doubling the number of threads. For example, strobealign seems to finish in near no-time regardless of 4, 8 or 16 cores - so it looks like the number of cores does not matter beyond 4 (which should not be the case).

Edit: Or in other words, so that reading index is negligible time out of the total time. (I understand this complicates the pipeline in working with larger datasets though)

ivargr commented 1 year ago

I think it looks better! However, to really see the time difference and scaling with threads, I would use enough reads so that all methods has the chance to show their (near half) reduction in runtime when doubling the number of threads.

Agree, I've rerun it with 10 mill reads and think maybe that should be enough (no problem in increasing it further, just takes more time to run). It's a bit difficult to visually see the runtime difference for high number of threads in the current plots, but it is clear that the number of threads help reduce the runtime when looking at the raw numbers in the results table. In the future, I think it would be nice to use the plotly html plots that support zooming, which would make the differences much more easy to spot.

Regarding the .sti files: I don’t recommend re-implementing the strobealign rules for deriving the canonical read length as this is an implementation detail that is subject to change.

Really agree on this, and having a marker-file is a very good idea. I think I've made that work now (80091f68ee1eb2076d62648434d7fe0b0d0e5b7d). Thanks for the samtools tips, btw!

Also, in case convenient, read group can be added from command line with --rg-id in v0.8.0

That's very nice. But I think for some reason GATK also needs the SM-tag in the header, and --rg-id only adds ID:.... What I specify with BWA-MEM to get this to work is -R "@RG\\tID:sample\\tSM:sample".

Regarding strobealign accuracy, I think the accuracy increased slightly when getting the indexing -r "correct". The latest generated plots now (accuracy vs read length) look quite similar to those in the Strobealign manuscript (i.e. a bit worse accuracy than bwa-mem for low read lengths and good accuracy for higher read lengths):

newplot (2)

On the first plot that you commented on, accuracy was shown for various MAPQ-scores, and this also made strobealign look a bit worse since it performs worse for high MAPQs (which dominated the plots visually) and better and more similar to the other mappers for low mapqs (i.e. all reads included, which was more difficult to see in the plot). On a side-note, I don't know how mapq is calculated in strobealign, but it seems that strobealign might not compute mapq scores as accurately as the other aligners and thus come out worse for high mapqs (maybe that is something one can look into).

Anyway, as you see I've added Strobealign to the automatic runs again, and also added a disclaimar that this pipeline is under development and that results may be wrong and should not be trusted. But I think the plots now look somewhat reasonable, and I'll continue adding more cases/settings. Hopefully some of these results will be useful in pinpointing where potential improvements for methods can be done.

Thanks for all the help so far, really appreciate it!

marcelm commented 1 year ago

But I think for some reason GATK also needs the SM-tag in the header, and --rg-id only adds ID:....

Yeah, you also need --rg SM:samplename to add a sample name. (This is copied from how Bowtie2 does it.)

ksahlin commented 1 year ago

In the future, I think it would be nice to use the plotly html plots that support zooming, which would make the differences much more easy to spot.

Yes, or alternatively log-scale on the y-axis.

On the first plot that you commented on, accuracy was shown for various MAPQ-scores, and this also made strobealign look a bit worse since it performs worse for high MAPQs (which dominated the plots visually) and better and more similar to the other mappers for low mapqs (i.e. all reads included, which was more difficult to see in the plot). On a side-note, I don't know how mapq is calculated in strobealign, but it seems that strobealign might not compute mapq scores as accurately as the other aligners and thus come out worse for high mapqs (maybe that is something one can look into).

I see, that makes sense. I also have a pretty good feeling that strobealign can improve its MAPQ score values. Something we will look into. Thanks for providing us with the analysis!

ksahlin commented 1 year ago

Sorry for the million questions, but is the plot you attached in your comment showing F1 score for reads aligning over genome sites with variants (right column) and without variants (center column) separately? If so, are the variants consisting of indels only, or does it also include SNPs?

ivargr commented 1 year ago

The variants are both SNPs and Indels, i.e. all variants from the VCF used to simulate reads (in this case the HG002 GIAB vcf). So these are reads that are simulated so that they contain a genetic variant. I think it would be cool to have separate plots for SNPs/indels and also a plot showing accuracy for various indel lenghts (total length of all variants within a read maybe).

ksahlin commented 1 year ago

Agree, I've rerun it with 10 mill reads and think maybe that should be enough (no problem in increasing it further, just takes more time to run). It's a bit difficult to visually see the runtime difference for high number of threads in the current plots, but it is clear that the number of threads help reduce the runtime when looking at the raw numbers in the results table. In the future, I think it would be nice to use the plotly html plots that support zooming, which would make the differences much more easy to spot.

Another solution would be to plot "aligned reads per second" on y axis (i.e., divide nr reads with alignment time). This would not require scaling up the dataset that much and may be more visible in the plot.

bioinf-benchmarking / mapping-benchmarking

Some benchmark remarks #1