Multiple reads in one fast5

AdrienJarretier commented 5 years ago

Hi, me again,

So I am trying to analyse new data from a recent sequencing with the MinION,

Trouble is, they changed the format of the output files a bit, now instead of having one file per read, MinKnow gives multiple reads in one file like so : hdfviewmultiplereads

It's a good thing, since before it gave us way too many small files and it was difficult for the filesystem to handle.

So NanoTableM cannot read this, I think the culprit is actually the rhdf5 there :

https://github.com/davidebolo1993/NanoR/blob/master/Scripts/NanoTableM.R#L49

I'm looking into it, maybe there is a simple way to handle this with rhdf5, but if you have any input i'd be happy to hear it.

Here is a link to download the file from the screenshot above : https://we.tl/t-bwTQfZaGOf

davidebolo1993 commented 5 years ago

Hi @AdrienJarretier,

No worries, I've already a new NanoR version with lots of improvements almost ready to be shared with all of You: at the moment, NanoR is submitted to a scientific journal and last time I submitted the revision I didn't have a test set for multi-read .fast5 files. But few days ago I managed to get that and I developed a 2.0 NanoR version that will be online shortly. Attached there are the new versions, fell free to use them.

This is a possible workflow for plotting statistics.

DataPass<-'/path/to/fast5_pass' DataFail<-'/path/to/fast5_fail' Label<-'MultiRead' DataOut<-'/path/to/dataout' List<-NanoPrepareM(DataPass,DataFail, Label=Label, MultiRead=TRUE) Table<-NanoTableM(List,DataOut,Cores=6,GCC=TRUE) #or FALSE NanoStatsM(List,Table,DataOut)

You can also extract .fastq files with user-defined quality using NanoFastqM(). There are also some other options that were added, but I'll discuss them in the new README that will be published after i received the revision.

NanoTableM.txt NanoStatsM.txt NanoPrepareM.txt NanoFastqM.txt

AdrienJarretier commented 5 years ago

Amazing, I didn't expect that much that fast.

I will test that right away, thank you very much.

davidebolo1993 commented 5 years ago

Let me know if it works as expected. It works for my data set but it is always good to have a second opinion !

AdrienJarretier commented 5 years ago

So no it does not work but it's on the right track, what I did :

Replaced the 4 scripts in the NanoR.tar.gz archive with the 4 new ones
Ran install.packages('NanoR.tar.gz', repos=NULL)
Ran the workflow you proposed on a directory containing the single file I linked before : https://we.tl/t-bwTQfZaGOf

We can observe that the files seem to be read correctly since the table is not empty, but then there is still an error.

> Label<-'MultiRead'
> 
> List<-NanoPrepareM(DATA_PASS_DIR,DataFail=NA, Label=Label, MultiRead=TRUE)
1 multiread .fast5 files specified as passed
No failed .fast5 files path specified
No skipped .fast5 files path specified
Done
> 
> Table<-NanoTableM(List,DATA_OUT_DIR,Cores=4,GCC=TRUE) #or FALSE
Extracting metadata and calculating GC content from multi-read .fast5 files...
Done
> 
> head(Table)
     Read Id                                Channel Number Mux Number
[1,] "000d64fc-5091-44ef-959e-2dd3aeb059fe" "87"           "3"       
[2,] "0045408f-5155-4a5c-a383-9d76863eaa57" "253"          "1"       
[3,] "005d8a76-dbe7-45df-8adf-0627aa8636a3" "472"          "3"       
[4,] "005fd50c-39c8-4be7-a073-abebe213bfe6" "161"          "4"       
[5,] "0067c135-8ca6-4e51-81a7-ce7249bae342" "355"          "2"       
[6,] "007d49b2-7201-4394-afc7-84d1c486a323" "295"          "4"       
     Unix Time    Length of Read Quality  GC Content         
[1,] "1550590230" "712"          "Qscore" "0.401685393258427"
[2,] "1550590340" "939"          "Qscore" "0.39297124600639" 
[3,] "1550590223" "497"          "Qscore" "0.378269617706237"
[4,] "1550590260" "4719"         "Qscore" "0.339902521720704"
[5,] "1550590306" "4359"         "Qscore" "0.379215416379904"
[6,] "1550590244" "1913"         "Qscore" "0.569785676947203"
> 
> NanoStatsM(List,Table,DATA_OUT_DIR)
Error in seq.default(from = min(round(Relative_Time)), to = max(round(Relative_Time)),  : 
  'from' must be a finite number
Calls: NanoStatsM -> seq -> seq.default
In addition: Warning messages:
1: In which(as.numeric(NanoTable[, 6]) >= 7) : NAs introduced by coercion
2: In which(as.numeric(NanoTable[, 6]) < 7) : NAs introduced by coercion
3: In max(Time_2) : no non-missing arguments to max; returning -Inf
4: In min(Time_2) : no non-missing arguments to min; returning Inf
5: In min(x) : no non-missing arguments to min; returning Inf
6: In max(x) : no non-missing arguments to max; returning -Inf
7: In min(round(Relative_Time)) :
  no non-missing arguments to min; returning Inf
8: In max(round(Relative_Time)) :
  no non-missing arguments to max; returning -Inf
Execution halted

davidebolo1993 commented 5 years ago

Well, this is not an issue. The problem is that You run your analysis using one single multi-line .fast5 file and not a complete Run. Indeed, NanoR is built to work on experiments that run for hours, not minutes (default experimental run duration is 48 hours). Try using a complete set of multi-read .fast5 files, this will work. Otherwise, it can’t rescale time correctly !

Best

AdrienJarretier commented 5 years ago

Yes, I though of that, i ran it on the whole experiment too, it took 4 hours before it crashed with this error, in 2 days I should have a brand new experiment to try it on again though.

davidebolo1993 commented 5 years ago

I just re-tested the same function I sent You in the multi-read .fast5 files I have and it works perfectly. It seems very strange to me that NanoStatsM took 4 hours to calculate statistics.. moreover, it seems that the quality score was not correctly extracted from your .fast5 files.

I’ll look into the sample you gave me tomorrow and let you know.

davidebolo1993 commented 5 years ago

Hi @AdrienJarretier,

I've looked into your testset. First of all You are using multiple cores when analyzing one multi-read .fast5 files. Probably this is not a problem, but it's better if you have at least 4 multi-read .fast5 files to use 4 Cores. Moreover, I changed a little bit the code for NanoTableM and NanoStatsM, which are attached below. Now, the quality should be extracted without problems (even if i could not recreate the issue, I guess it was something related to a variable with the same name of another). I can confirm, however, that the main problem is your run duration, which is only 4 minutes (too short). Let's keep in touch for the new experiment.

You can run them as suggested before

NanoTableM.txt NanoStatsM.txt

Closing the issue, for now

Best,

Davide

davidebolo1993 / NanoR

Multiple reads in one fast5 #6