COMBINE-lab / pufferfish

An efficient index for the colored, compacted, de Bruijn graph
GNU General Public License v3.0
107 stars 19 forks source link

kquery, missing command #28

Closed chklopp closed 2 years ago

chklopp commented 2 years ago

Hi,

I've indexed a fasta file with pufferfish and it worked. Then I wanted to find the locations of a set of kmers and tried pufferfish kquery, I got :

There is no command "kquery" The valid commands to pufferfish are : {align, examine, index, kquery, lookup, stat, validate}

Any idea what went wrong?

rob-p commented 2 years ago

Hi @chklopp,

Thanks for the report. As you can see from the commits, the kquery command is brand new, so it's possible there are still some interface bugs. Can you share the full command line you are attempting to execute that is giving the output above?

Thanks, Rob

chklopp commented 2 years ago

Hi @rob-p

If I only enter

pufferfish kquery 

I get

There is no command "kquery"
The valid commands to pufferfish are : {align, examine, index, kquery, lookup, stat, validate}

My full command line is :

pufferfish kquery -I Fa_pufferfish -o puffer_out -q Fa.genome.new.fasta --threads 4

An generates the same result.

rob-p commented 2 years ago

Hi @chklopp,

Ah, yes; this is confusing and has to do with the way that the help message processing is handled when the full command line isn't matched. I will fix this (after I run off to teach a class). In the mean time, please try the following command line:

pufferfish kquery -i Fa_pufferfish -o puffer_out -q Fa.genome.new.fasta --threads 4

Note, it exactly as the one you had before, except that -i must be lower case rather than upper case. Also, I fixed a small bug that would cause abnormal termination (after processing the input) yesterday afternoon, so please be sure you've pulled the latest commit.

Best, Rob

chklopp commented 2 years ago

I get the same

There is no command "kquery"
The valid commands to pufferfish are : {align, examine, index, kquery, lookup, stat, validate}

pufferfish -v
version 1.0.0
rob-p commented 2 years ago

Hi @chklopp,

Ok, one last thing to try while I dig into the code — I missed this my first time looking at your command. This kquery command also does not take an output parameter -o the results are written to stdout. Can you please try:

pufferfish kquery -i Fa_pufferfish -q Fa.genome.new.fasta --threads 4 > puffer_out 

Also, just to note, I am not sure of the format of Fa.genome.new.fasta, but the query file should be a fasta file with 1 k-mer per line rather than longer strings. That is, kquery currently doesn't k-mer-ize the input strings. I can certainly look into that functionality if you want it, but right now the queries must each be 1 k-mer long. I believe what will happen currently is that the executable will just query the first k-mer of each record.

--Rob

rob-p commented 2 years ago

Does this modification work for you, @chklopp?

chklopp commented 2 years ago

Yes, it some how worked until...

pufferfish kquery -i Fa_pufferfish -q kmers.dump.fasta --threads 4 > puffer_out

Index type = dense
-----------------------------------------
| Loading contig table | Time = 46.803 s
-----------------------------------------
size = 46437557
-----------------------------------------
| Loading contig offsets | Time = 140.28 ms
-----------------------------------------
-----------------------------------------
| Loading reference lengths | Time = 19.845 us
-----------------------------------------
-----------------------------------------
| Loading mphf table | Time = 388.58 ms
-----------------------------------------
size = 2224519276
Number of ones: 46437556
Number of ones per inventory item: 512
Inventory entries filled: 90699
-----------------------------------------
| Loading contig boundaries | Time = 5.301 s
-----------------------------------------
size = 2224519276
-----------------------------------------
| Loading sequence | Time = 415.41 ms
-----------------------------------------
size = 831392596
-----------------------------------------
| Loading positions | Time = 2.4349 s
-----------------------------------------
size = 2275944177
-----------------------------------------
| Loading reference sequence | Time = 415.96 ms
-----------------------------------------
-----------------------------------------
| Loading reference accumulative lengths | Time = 18.439 us
-----------------------------------------
terminate called without an active exception

It does not seem to be related to SLURM memory consumption (no oom kill). Any idea what went wrong?

rob-p commented 2 years ago

Yes, this was the thread join issue I mentioned above. If you build from the latest commit, the error should go away.

chklopp commented 2 years ago

With the latest version I get a segmentation fault error at the same point

pufferfish-b9fb67b/build/src/pufferfish kquery -i F252_pufferfish -q d4260e1d07c24dd88bea.double.dump.fasta --threads 4 > puffer_out
Index type = dense
-----------------------------------------
| Loading contig table | Time = 81.871 s
-----------------------------------------
size = 46437557
-----------------------------------------
| Loading contig offsets | Time = 134.6 ms
-----------------------------------------
-----------------------------------------
| Loading reference lengths | Time = 16.801 us
-----------------------------------------
-----------------------------------------
| Loading mphf table | Time = 354.38 ms
-----------------------------------------
size = 2224519276
Number of ones: 46437556
Number of ones per inventory item: 512
Inventory entries filled: 90699
-----------------------------------------
| Loading contig boundaries | Time = 6.2407 s
-----------------------------------------
size = 2224519276
-----------------------------------------
| Loading sequence | Time = 392.77 ms
-----------------------------------------
size = 831392596
-----------------------------------------
| Loading positions | Time = 2.2985 s
-----------------------------------------
size = 2275944177
-----------------------------------------
| Loading reference sequence | Time = 384.01 ms
-----------------------------------------
-----------------------------------------
| Loading reference accumulative lengths | Time = 19.21 us
-----------------------------------------
Erreur de segmentation
rob-p commented 2 years ago

Can you share the reference and query that produce this error? I've been testing so far on the human genome and random k-mers taken from the human genome. If I have access to the actual sequence that is causing the error on your end, I should be able to debug what is going on much more easily.

Thanks, Rob

chklopp commented 2 years ago

You can use https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/902/167/145/GCF_902167145.1_Zm-B73-REFERENCE-NAM-5.0/GCF_902167145.1_Zm-B73-REFERENCE-NAM-5.0_genomic.fna.gz as reference and http://genoweb.toulouse.inra.fr/~klopp/tmp/d4260e1d07c24dd88bea.double.dump.1000.fasta as query.

I've reduced my input file to 1K and 1M kmers and it works but the output is weird

@HD VN:1.0  SO:unknown
@SQ SN:1    LN:321461969    DS:T
@SQ SN:2    LN:250306446    DS:T
@SQ SN:3    LN:248000818    DS:T
@SQ SN:4    LN:263451608    DS:T
@SQ SN:5    LN:231459016    DS:T
@SQ SN:6    LN:177665613    DS:T
@SQ SN:7    LN:188892197    DS:T
@SQ SN:8    LN:185106115    DS:T
@SQ SN:9    LN:173365549    DS:T
@SQ SN:10   LN:158230103    DS:T
@SQ SN:10000001 LN:78004743 DS:T
@PG ID:pufferfish   PN:pufferfish   VN:1.0.0
1   0
2   0
3   0
4   0
5   0
6   0
7   0
8   0
9   0
10  0
11  0
12  0
13  0
14  0
15  0
16  0

What does it mean? The first lines look like a SAM but the rest...

rob-p commented 2 years ago

Right, so the output is compressed. The header is a sam header, the entries are of the format:

<query name:q>\t<num occ:n>
<occ 1 ref>\t<occ 1 position>\t<occ 1 orientation>
<occ 2 ref>\t<occ 2 position>\t<occ 2 orientation>
...
<occ n ref>\t<occ n position>\t<occ n orientation>

So 1 0 means that query 1 appears 0 times in the reference.

rob-p commented 2 years ago

Hi @chklopp,

So, I ran on the example you provided, and I didn't get a segfault. Did this particular dataset cause one for you? Here is the query result file that my run generated.

--Rob

query_out.txt

chklopp commented 2 years ago

Hi @rob-p,

It ran also for me with 1K, 1M but generated a segmentation fault with 10M kmers. These kmers have all been extracted from the same genome therefore they should all be found.

After checking my command lines I found that pufferfish uses a kmer length used of 31 for the index (default value) and the kmers I seek are 21 bp long. When I reindexed the genome with -k 21 it worked for 1K but generated a segmentation fault for 1M.

rob-p commented 2 years ago

Hi @chklopp,

Ok, I will try to extract 1m kmers and see if I can query them, or if you could share your 1M query file, that’d be great too.

chklopp commented 2 years ago

http://genoweb.toulouse.inra.fr/~klopp/tmp/d4260e1d07c24dd88bea.double.dump.1M.fasta.gz

chklopp commented 2 years ago

It works with 100K.

rob-p commented 2 years ago

Hi @chklopp,

Thanks! I'm able to reproduce this and am looking into it. I'll report back here when I have more info.

--Rob

rob-p commented 2 years ago

Hi @chklopp,

Ok — I have fixed this bug in the latest commit. It was a lifetime related use after free. I've gotten too used to working in the rust language, where the compiler would trivially catch these kinds of silly errors. Sorry for all the trouble. Please let me know if it works for you now.

Thanks, Rob

P.S. I will note that not all of the k-mers were found in the reference. Only 845,220 of the k-mers are matched in the reference. I spot checked some of the ones that were missed, and grep also fails to find them (or their reverse complement). So, I suspect not all of the keys / k-mers here actually come from the reference.

rob-p commented 2 years ago

Hi @chklopp,

I'll assume that this has resolved the issue, but feel free to ping back here or re-open this issue if it has not.

Best, Rob

chklopp commented 2 years ago

Hi @rob-p,

It works! But the output file is 1.2Tb large for a 1.6Gb input file. I had a look at the output file at it contains many times the same lines

When I compare the old version and the new one on 1M kmers :

wc puffer_out_1M_b9fb67b puffer_out-1M_0c2b0a2
   230634    614327   2727820 puffer_out_1M_b9fb67b
 41528896 110749020 493976276 puffer_out-1M_0c2b0a2

grep -v '^@' puffer_out-1M_0c2b0a2 | sort -k1,1n | more
0   100000438   f
0   100000438   f
0   100000438   f
0   100000438   f
0   100000438   f
0   100000438   f
0   100000438   f
0   100000438   f
0   100000438   f
0   100000438   f
0   100000438   f
0   100000438   f

rob-p commented 2 years ago

Hi @chklopp,

Can you add some context to the grep (e.g. -C)? Is this redundant output for the same input kmer, or the result of repeated input? The first is a bug, the second can happen with repetitive or low complexity input. I could implement not writing output for a query occuring more than some threshold number of times. Also, since the output is written to stdout, it can be piped directly into gzip for compression.

--Rob

chklopp commented 2 years ago

You mean grep -v ? It is to remove the header from the sam format. I know the number of times a kmer I seek should be in the genome (= twice) and these kmers are unique in the input fasta file

grep -v '^>' 2f71f65871d3496eabc9.double.dump.fasta  | wc
18924502 18924502 416339044
grep -v '^>' 2f71f65871d3496eabc9.double.dump.fasta  | sort -u | wc
18924502 18924502 416339044

Therefore I expect only 3 lines per input kmer in the sam file. And one location should only be found once in the output file. The output is redundant but I do not know why.

rob-p commented 2 years ago

Sorry, I mean to use -C <some integer> in the grep command to see what record ID is associated with these redundant matches. Is this behavior present in the 1M read input you shared earlier? If so, I could take a look there.

chklopp commented 2 years ago

Yes it is the case with this file. NB. The genome I use is different from the public one I've pointed.

grep -n 100000438 puffer_out-1M_0c2b0a2 | head
1498768:0   100000438   f
2338844:0   100000438   f
2728884:0   100000438   f
4019057:0   100000438   f
5369327:0   100000438   f
6869606:0   100000438   f
8189829:0   100000438   f
10500215:0  100000438   f
12630576:0  100000438   f
14820954:0  100000438   f
rob-p commented 2 years ago

Great, I will take a look and report back with what I find.

rob-p commented 2 years ago

Ok, @chklopp, the issue seemed to be that the clear() method didn't actually clear the stringstream (sometimes the C++ method behavior is just impenetrable). I've push a commit that should fix the redundant output issue. Please let me know if it works for you.

Best, Rob

chklopp commented 2 years ago

This time all looks fine. Bravo, nice job, it is very efficient.