can genmap map run with very large values

splaisan commented 4 years ago

I would like to compute mappability of long reads on a repetitive plant genome to evaluate the efficiency of long reads (Pacbio or ONT) Can I use large values like K 1000 E 150 on a genome of 1GB or will this kill genmap or take ages and TB's to compute? Any advice for the 'index' command options to improve my later 'map' usage are welcome. Thanks for your advice

splaisan commented 4 years ago

I got -E >4 not yet supported I guess this is it for my usage idea as long reads have 10-15% intrinsic error rate and I wish to predict for kb kmer lengths; Or should I use 4 and get a perfect world result?

cpockrandt commented 4 years ago

Thanks for checking out GenMap! It was originally developed for short reads and up to 4 mismatches. I would like to continue the work and implement an algorithm for more mismatches (or even arbitrary error rates), but I am not sure whether there are applications for more mismatches without considering indels. Supporting indels is another point on my todo list but I don't think I can find time to implement it anytime soon and also the performance might drop significantly for indels.

If you want to know how mappable a read of 1'000 bp might be (by computing the mappability of 1'000 mers, although mapping and mappability are different concepts), an error rate lower than the error profile of the sequencing technology could be sufficient but (in your case) 0.4 while considering mismatches is probably not enough. Indexing a 1 GB genome and computing the mappability for 1'000 mers should still be quite fast, so you can just run it and see what the results look like. But I'm afraid a lot of regions of the genome will be "unique" (due to the low error rate).

splaisan commented 4 years ago

Thank you Christopher for the great tool and the support. I just managed to GenMap with kmer up to 10000 and error 4 and as expected the longer the kmers get the more unique positions in the genome. I want to use this info to motivate going long reads for this plant genome with 70% repeats and it comes clear than a length of 1000+ is much better than one of 100. I also will use the results to spot dangerous loci as usually done with mappability data. A pity we cannot estimate paired reads because the insert size is not constant as 150PE reads probably perform close to the total library footprint which is usually ~900bps. I now need to understand why the values obtained with your tool are not the same as those from gem mappability for kmers that the old tool can compute (100 to 1000). Maybe I should set your tool to -E 0 to be closer to gem? Best regards

cpockrandt commented 4 years ago

I don't have access to a binary of gem at the moment with a working mappability algorithm. Differences can occur due to the approximation of gem (you can turn it off explicitly to set the threshold parameter t (not sure about the argument name) to infinity (or at least very high - should be higher than the highest frequency of a k-mer in the genome).

splaisan commented 4 years ago

I am using an old version since the newer ones have lost the mappability feature and the author is not very active to re-introduce it.

I have use gem-mappability following an old report by Derrien et al and I documented my commands on our wiki few years ago (I added a link to your page today :-)

Below is the help, I used default settings for all non obligatory options (default: first multiple bin is not a very clear value to me !)

If you tell me GenMap does something similar to gem-mappability, I trust you on that; this is accessory analysis and is of no vital importance for my work.

Thanks for your nice support and feedback

gem-mappability
Welcome to GEM-mappability build 1.315 (beta) - (2013/03/29 02:59:40 GMT)
 (c) 2008-2013 Paolo Ribeca <paolo.ribeca@gmail.com>
 (c) 2010-2013 Santiago Marco Sola <santiagomsola@gmail.com>
 (c) 2010-2013 Leonor Frias Moya <leonor.frias@gmail.com>
For the terms of use, run the program with the option --show-license.
************************************************************************
* WARNING: this is a beta version, provided for testing purposes only; *
*          check for updates at <http://gemlibrary.sourceforge.net>.   *
************************************************************************
Fatal error (gem-mappability.c:301,main)
 Mandatory option(s) missing: -I, -o, -l
Usage:
 gem-mappability
  -I <index_prefix>                       (mandatory)
  -C|--emulate-complement                 (for indices lacking it)
  --granularity <number>                  (default=10000)
  -o <output_prefix>                      (mandatory)
  --output-line-width <number>            (default=70)
  -l <read_length>                        (mandatory)
  -t|--approximation-threshold <number>|'disable'
                                          (default: first multiple bin)
  --mismatch-alphabet <symbols>           (default="ACGT")
  -m <max_mismatches>|<%_mismatches>      (default=0.04)
  -e <max_edit_distance>|<%_differences>  (default=0.04)
  --max-big-indel-length <number>         (default=15)
  --min-matched-bases <number>|<%>        (default=0.80)
  -s|--strata-after-best <number>         (default=1)
  -T <thread_number>                      (default=1)
  --show-license                          (print license)
  -h|--help                               (print usage)

cpockrandt commented 4 years ago

You can disable the approximation with -t 'disable'.

Roughly speaking: if the algorithm chooses -t 7 (such as for the human genome), it means that if it searches a k-mer that occurs more than 7 times, the algorithm will locate where this k-mer occurs in the genome and set the same mappability value at those positions (and not search the k-mer later again - actually it can happen but we skip this case here for simplicity). This strategy will give you the same results as GenMap does (if both are run with 0 errors), but for mismatches/indels these other k-mer locations could actually have lower or higher mappability values than gem reports. But turning this approximation off slows gem down significantly. GenMap does not use this approximation and will give you the exact mappability value at each position.

Here's an example. Let's assume you compute the mappability for 3-mers with up to 1 mismatch. The first position has a mappability of 0.333 (ACG occurs 3 times with up to 1 mismatch). gem's approximation could set the mappability at the locations of ACC and AGG also to 0.333 (if the threshold parameter is -t 3 or lower), but actually ACC and AGG occur each only twice with up to one mismatch, hence their mappability value should only be 0.5.

ACG .... ACC ... AGG ...

I have use gem-mappability following an old report by Derrien et al and I documented my commands on our wiki few years ago (I added a link to your page today :-)

Thanks! :-)

Edit: Unfortunately I don't remember anymore what they meant with first multiple bin. Edit2: Since they do not give you exact mappability/frequency values (but also bin them such as occurs between 10-15 times, the first multiple bin is the first bin that represents more than one mappability/frequency value). It's more clear if you look at Fig. 2 at the plos paper, where the first bin that represents more than one frequency value is (actually) [8-9], the ones before are [1], [2], ..., [7]

cpockrandt / genmap

can genmap map run with very large values #19