Psy-Fer / SquiggleKit

SquiggleKit: A toolkit for manipulating nanopore signal data
MIT License
120 stars 23 forks source link

Fast5_fetcher hanging on 'extracting' stage #27

Closed dn-ra closed 4 years ago

dn-ra commented 4 years ago

Hello!

Excited to use this tool but I'm getting an issue. Whether I use fast5_fetcher or fetcher_multi, the program starts up and begins extracting, but pauses there. No errors are announced but I check the size of the fast5 output and no matter how much time has elapsed it is always 304kb and there is only one file. Screenshot of the command here: image

Any idea what's going on?

Thanks! Dan

Psy-Fer commented 4 years ago

Hey Dan,

I have a feeling it's going to be my crappy method of column extraction. For a quick fix, can you do a

head sequencing_summary.txt And give me the output? Then I can tell you how to update it.

I'll be updating SquiggleKit soon to do this better. But this might get you started without having to wait.

If there is anything else I can help with, let me know.

James

dn-ra commented 4 years ago

Ah I saw you write that on another issue but didn't realise it would be related.

Here's the output. head_seqsum.txt

Dan

Psy-Fer commented 4 years ago

Hello,

Hmm, okay that's the albacore seq_sum. That should be fine. Can you tell me more about how the files are stored? Including a head of the index file?

Thanks.

dn-ra commented 4 years ago

index_head.txt

The fast5s are stored in a tar file. Fastqs are stored in an uncompressed file dna_pass.fastq. Header here: head_fastq.txt

Let me know if you need anything more. Dan

dn-ra commented 4 years ago

Would it tell me if it couldn't find a fast5 matching a particular fastq read? Or would it hang like this? Just thinking out loud.

Psy-Fer commented 4 years ago

So it's one big tar file? Try adding -z to the command. This should output just the tar extraction instructions, rather than doing the actual extractions.

Should help figure out what is going on. Let me know if that works properly.

Thanks for bearing with me.

dn-ra commented 4 years ago

That's odd. Goes straight to saying that it's done. image

When I run those same arguments with fast5_fetcher_multi.py it actually doesn't produce any output at all: image

EDIT---- Sorry, realised it outputs the commands to tater_master.txt. Just one line of contents in it: tater_gn090_rna.tar.txt /home/daniel/TEST_squiggle/gn090_rna.tar

Psy-Fer commented 4 years ago

And does the tater_gn090_rna.tar.txt file have information in it?

This essentially creates a hacky way of doing parallel extraction with a high performance cluster.

I think the reason it takes so long, is because it is 1 single big tar file, and the i/o on reading and extracting on it, might just be taking ages.

Also, which operating system are you on? If you look at your index file, what happens when you do the following:

tar -xf /home/daniel/TEST_squiggle/gn090_rna.tar --transform='s/.*\///' -C output_folder/ workspace/fail/0/imb17_013486_20171109_FAE31833_MN17279_mux_scan_20171109_RNAseq_GN90_30237_read_100_ch_210_strand.fast5

Does it extract the file? the -C flag should be the destination you want it to go

dn-ra commented 4 years ago

Yep, tater_gn090_rna.tar.txt has 21669 lines in it. Head file: tater_head.txt

And yeah, it extracts it. But took quite a while to do just the one so you might be right. What's the solution to this?

OS is NAME="Ubuntu" VERSION="18.04.3 LTS (Bionic Beaver)"

Psy-Fer commented 4 years ago

Oh nice, ubuntu 18. same as i'm running at the moment. everything works just that BIT better than 16.04

Yea, it's just an issue in the way it was built, as it was expecting multiple tar files, each with a few thousand fast5 files in each.

One solution is to use the index file, cut the first few lines of, down to the first fast5 file line. Cut that file into groups of 4000 (or any number you want), so you have multiple txt files with that. Then you can do something similar to what my batch_tater.py script does and do a for loop over the txt files, and combine them, either into another tar, or into a multifast5 file with the ont_fast5_api

something like: split -l 4000 tater_gn090_rna.tar.txt

for batch in tar_files/*.txt; do tar -xf /home/daniel/TEST_squiggle/gn090_rna.tar --transform='s/.*\///' -C output/ -T $batch; python3 single_to_multi_fast5 --input_path output/ --save_path /multi/; done

Not sure what your bash is like, but I hope that makes sense (I didn't test this, and wrote it on my phone, so just double check :) )

Then once you have the multi_fast5 files, you can use either ont_fast5_api for selective extraction (they added it after they saw fast5_fetcher), or use fast5_fetcher_multi.py

Let me know how you get along.

James

dn-ra commented 4 years ago

Hey James,

Sorry I was unresponsive for a while. I've been out with the flu.

I'm a little confused what this is doing. This is what I'm understanding: Take the tater_batch file, split it into smaller chunks, then for each of those chunks untar my big fast5 file and extract the relevant fast5s into a multi_fast5 file. Why do I have to do the fetching again after that? Hasn't it already extracted the fast5s that I've pointed it to with my .paf files?

Thanks a million for all this, Dan

Psy-Fer commented 4 years ago

Ahh, yes and no.

If you select everything initially to make the tater file, you can use it to stream the data into multi-fast5 files, without killing file limits on HPC environtments (ours has a 1M file limit for example).

Otherwise, yes, you are right, you will have all the files you need in multi-fast5 format, and can then use these, or other tools, however you like.

Hope you are feeling better. James

dn-ra commented 4 years ago

I have it working now!

I was getting some other errors from the ont_fast5_api single_to_multi_fast5 script so I just output them all as single fast5 files to an output.

I've been looking at plotting the squiggles in line with the basecalled data. It's my impression that SquiggleKit doesn't have this functionality and is more for handling the fast5 files themselves before handing them over to other tools like Tombo (aside from motifseq and segmenter). Is this right?

Dan

Psy-Fer commented 4 years ago

Squiggle alignment is something tombo and nanopolish/f5c do with re-squiggle and event-align respectively. They already do this pretty well, so I wasn't going to do it again. I built squigglekit so you could have a look at the data, get comfortable with how the signal looks, as well as some examples on how to do some simple analysis on that data.

I've mostly used the toolkit myself for checking strange anomalies with various experiments, or as this paper puts it (https://www.biorxiv.org/content/10.1101/852665v1), they use it for the same thing.

Or i've used it to say, split all the fast5 files into batches per cell, for our single-cell sequencing methods (https://www.nature.com/articles/s41467-019-11049-4)

or strongly labelling data for use in deep learning (https://github.com/Psy-Fer/deeplexicon)

And MotifSeq was originally built for another purpose years ago for something else which isn't released yet.

I hope that gives some better background into my motivations for the toolkit.

dn-ra commented 4 years ago

Yes I understand it all a lot better now. Thanks for all your help. Dan

Psy-Fer commented 4 years ago

You are welcome :)

Cheers, James