Psy-Fer / SquiggleKit

SquiggleKit: A toolkit for manipulating nanopore signal data
MIT License
122 stars 23 forks source link

Error when using fast5_fetcher.py #19

Closed pterzian closed 4 years ago

pterzian commented 5 years ago

Hi there,

I am trying to use fast5_fetcher and I have some questions about its usage. My goal is to regroup the fast5s by chromosome after having mapped the fastqs on a whole genome reference. To do so I :

This is the line I'm running :

python SquiggleKit/fast5_fetcher.py -p aln/perChrm/chrm.1.paf \
    -s Sequencing_summaries/seq_summary_allrun.txt.gz \
    -i index_fast5/all.name.index.gz \
    -o fast5/fast5_chrm1/

And this is the error I get :

Traceback (most recent call last):
  File "SquiggleKit/fast5_fetcher.py", line 481, in <module>
    main()
  File "SquiggleKit/fast5_fetcher.py", line 117, in main
    print >> sys.stderr, "Starting things up!"
TypeError: unsupported operand type(s) for >>: 'builtin_function_or_method' and '_io.TextIOWrapper'. Did you mean "print(<message>, file=<output_stream>)"?

So I wonder if the structure of my fast5 files directories is falling in your Raw structure case. These are coming straight out the machine so they are multi-read files. Do I have to demultiplex them (with single_to_multi_fast5) before building the index ? Also, would you advise to store fast5 files in tar archives and only extract reads when needed ?

Thanks for your help, Paul.

Psy-Fer commented 5 years ago

Hey paul,

First off, thanks for trying out SquiggleKit.

That error is a python3 specific error to do with the stderr print method being used, which is only valid in python 2. typing python --version should be able to figure that one out.

However, with multifast5 files, please use fast5_fetcher_multi.py You then don't need an index and can run it like so.

python3 fast5_fetcher_multi.py -p aln/perChrm/chrm.1.paf \
-s Sequencing_summaries/seq_summary_allrun.txt.gz \
-m fast5/top/directory/ \
-o fast5/fast5_chrm1/ 

There is one caveat, the sequencing summary file may be in a format not compatible with the tool, as it keeps changing and I have yet to put in dynamic column finding.

If it runs, but doesn't give you any data, please send me the first row of the sequencing_summary file.

Also please note, that this will currently produce a folder with as many single files as you have reads in the paf. I've been working on packaging them back up again, but this has yet to be released.

Let me know if this works out and if I can provide any further help.

Regards, James

pterzian commented 5 years ago

Hey James,

Thanks for the quick answer, I don't know how I missed that fast5_fetcher_multi.py, but I did. I tried it and it failed when recovering filenames list, from what I see in your code I think it might come from the sequencing summary file as you said. The error :

Multi-fast5 mode detected in mode: ms
Checks passed!
Starting things up!
Getting multi-fast5 info...
no filenames list built, check inputs
Extracting reads from multi-fast5 files...
No file paths built
done!

Script end.

As you requested here is the header of my summary files :

filename_fastq  filename_fast5  read_id run_id  channel mux start_time  duration    num_events  passes_filtering    template_start  num_events_template template_duration   sequence_length_template    mean_qscore_template    strand_score_template   median_template mad_template

I also wonder if the structure of my fast5 files is alright because what I gave to the --multi_f5 parameter is the path to a directory that contains 3 symbolic links to 3 different run. Thought it would be worth mentioning it.

best, Paul

Psy-Fer commented 5 years ago

Hey Paul,

Ahh yes as I expected.

If you like, I can upload a version that will work for you, or you can simply change the following in the fast5_fetcher_multi.py file and be on your way

There is a function called: get_filenames_multi_f5() around line 477

at the bottom there is the following

            else:
                if line[1] in ids:
                    if line[0] not in files:
                        files[line[0]] = []
                    files[line[0]].append(line[1])

if you change it to

            else:
                if line[2] in ids:
                    if line[1] not in files:
                        files[line[1]] = []
                    files[line[1]].append(line[2])

This will make it work.

as for the symlinks, i'm not sure actually, something I didn't check with the directory traversal. One way to find out though. Give it a try and let me know how it goes.

Again, if you would like me to upload a version of the code that works for you instead of editing it, (i know it's a bit hacky at the moment), just let me know. Happy to.

Regards James

pterzian commented 5 years ago

Hi James,

So I made the modifications to the function and it failed again when trying to extract reads from multi-fast5 files, generating the same No file paths built. As I thought it is caused by my fast5s top directory that contains only symlinks. I just had to change os.walk(path) to os.walk(path, followlinks=True) and now it seems like it is working :) ! I could propose a pull resquest if you would like to. As well for the dynamic column finding, we could play with the header.

I guess I have one last question, the error file reports around 15% of the reads to be missing (Failed to find read 'xxx' in xxx.fast5). Would you have an explanation for this ?

best, Paul

Psy-Fer commented 5 years ago

Hey Paul,

Ahh yes. I've had this same issue myself recently. An unfortunate thing with nanopore naming scheme, is it names the files in the pass/fail folders the same thing. The way i'm currently storing them, using the file name, not the path, as the unique identifier, makes this fail when it is linked to the wrong file.

There are a few solutions to this, like using the TRUE/FALSE call in the passes filtering field to use the pass/fail folders. However this then relies on folder structure being maintained, which is usually more dubious than file structure (though maybe not in the case of nanopore).

I made a feature request to make the names different just the other day.

Thanks for figuring out the symlink option.

The dynamic column finding has been on my list for a while. I was just going to take the column headers, enumerate them, and then match them to a format/version i've been keeping track of here (not complete) : https://github.com/Psy-Fer/nanopore_formats

Regards, James

pterzian commented 4 years ago

Hey James, sorry for the delay I had not much time to work with fast5_fetcher those past weeks. So actually the error Failed to find read 'xxx' in xxx.fast5 was my fault, I had given a paf file made on all fast5 and then fetch in the pass folders so all reads from the fail folder were missing obviously. So it is not really an issue.

However I have a new question :p. I thought fast5_fetcher was working with BAM files as written in the graphics presenting the tool on SquiggleKit's main page. Is this something that is still on the to do list ? I see in the documentation the script is works only with paf/fastq/txt files ?

best, Paul

Psy-Fer commented 4 years ago

Hey Paul,

So it doesn't work with bam files just yet. Though sam/bam is in the works. The trick to do it now, get the bam file you want, then do something like

samtools view filtered.bam | cut -f1 > readIDs_flat.txt

Where it just cuts out the read ID and puts it in a flat file.

I think I will modify the "flat" method to just take any file type that has the first column which contains a readID, as this is pretty simple to figure out for paf, sam, bam, etc given extension

harisankarsadasivan commented 4 years ago

@Psy-Fer @pterzian , So I am using Python3 and facing the issue where no FAST5's are dumped into the output folder. Should I take the flat file and provide it as input to the fetcher script instead of the paf file?

noncodo commented 4 years ago

I get the same error as @pterzian, but using python 3.6.3, specifically:

python3 ~/apps/SquiggleKit/fast5_fetcher.py -f readsToPull.txt -s Sequencing_summary.txt -i name.index.gz -o ./fast5_pulled
Traceback (most recent call last):
  File "/home/user/apps/SquiggleKit/fast5_fetcher.py", line 481, in <module>
    main()
  File "/home/user/apps/SquiggleKit/fast5_fetcher.py", line 117, in main
    print >> sys.stderr, "Starting things up!"
TypeError: unsupported operand type(s) for >>: 'builtin_function_or_method' and '_io.TextIOWrapper'. Did you mean "print(<message>, file=<output_stream>)"?
noncodo commented 4 years ago

Ah, it works using python2...

Psy-Fer commented 4 years ago

Hello @noncodo

Only fast5_fetcher_muti.py has python3 support.