Possible changes to fastq input

johnlees commented 1 year ago

Two possible additions

An equivalent to the `-m` option in ska1:

Finally, the base call for the middle base in the split kmer is filtered to remove any bases where the minor alleles are found in more than 20% of the kmers. This is a strategy often used in read mapping to account for sequencing error, and can be adjusted using the -m option.

We can filter on ambiguous, but not frequency of ambig.

Early stop/fast mode

This could probably made to go much faster by:

Stopping reading the input after n reads (set by e.g. 15x coverage)
For 'online' or ref-based analyses, only including k-mers in the ref and turning off filters (may need minimiser based bins to improve caching).

For now, the first of these could be a useful addition

johnlees commented 1 year ago

Another thought, a fastcall mode. Give a ref (.fa or .skf) and call SNPs against it given either a coverage or CPU time threshold.

As there will be no error correction makes sense to have a counter for each of four middle bases and then do a simple model to generate the call (e.g. [1, 0, 0, 5] -> G; [1, 3, 4, 0] -> Y. Probably looks like the bcftools model, feels multinomial to me). Or just keep the BBF to filter singletons (include middle base in hash)

johnlees commented 1 year ago

Important point from @rderelle from his work on fastlin: many fastq files in SRA come from BAMs, so are sorted and you cannot take the first n reads as a random subsample. Checking for subsequent reads having the same minimiser would probably catch this in most cases.

rderelle commented 1 year ago

I might have a suggestion regarding the fast mode (codename: 'flyover'):

(1) to parse the first fastq file and only extract kmers from 1 read every 20 reads. (2) at the end of this first pass, if the average coverage of the split-kmers is above the cutoff (e.g. 80X), then SKA can end the analysis ... the runtime has been roughly divided by 20. (3) if not, SKA could calculate how many more 20th are needed to reach the coverage cutoff (if 2 fastq files, SKA could reasonably assume that the number of reads in both files are the same) -> second pass with for instance extracting kmers from the 2nd, 3rd and 4th reads every 20 reads if 15% more reads are needed (in this case the runtime has been roughly divided by 5)

This approach would be able to analyse all types of fastq files (sorted or not). However, it would only make sense if the algorithm used to parse the fastq files is ultra-fast as it would require 2 successive parsing in most cases (e.g., 'seq_io' as in fastlin).

Edit: the examples mentioned above are probably not realistic (samples with very high coverages) since this approach should only divide the runtimes by a factor 3-4 at the most (depending on the cutoff).

johnlees commented 1 year ago

Thanks for the suggestion! Here's some profiling I did with fastq input (with the older countmin filter, which is a bit slower than the current implementation): The bloom filter takes around ~75% of the time when reading a file in, file reading is about 8% of the time (mostly spent on decompression with .fq.gz input, but when I've tested previously decompressing doesn't usually make sense because the disk access takes proportionally longer), the k-mer processing is around 7% of the time, dictionary adds are minimal.

So the reading is actually very fast if we skip the parsing, probably around 10-15x faster than current run, which might make this viable.

Finally, I think the best solution would be to check for sorted reads with minimisers, in which case you can just do one run, but if sorted then subsample

johnlees commented 1 year ago

Full profile if you are interested (open in firefox) flamegraph_cm_bb

rderelle commented 1 year ago

The minimisers approach seems simpler to implement but, for the sake of the discussion (I'm not advocating against it), it might suffer from two caveats: you might need to introduce another cutoff for the comparison of the minimisers and decide if reads are sorted or not (not entirely sure about this point as I have never used minimisers myself) if SKA concludes that reads are sorted, it would not be able to uniformly subsample along the fastq files without knowing a priori how many reads there are.

May I ask what profiling program did you use to generate these profiles? :)

johnlees commented 1 year ago

May I ask what profiling program did you use to generate these profiles? :)

It's flamegraph, which is really nice

johnlees commented 5 months ago

For mapping to a reference with reads, using a binary fuse filter for the ref knots could be a good choice. See https://docs.rs/xorf/latest/xorf/

bacpop / ska.rust