run error - Githubissues

Johnsonzcode commented 4 years ago

I ran fmlrc with fowllowing code:

ONT=/storage-01/poultrylab1/zhaoqiangsen/nanopore/genome/fq_gz/YY_ONT.fq.gz
npy=./MSBWT/comp_msbwt.npy
fmlrc2 -t 12 -C 10 $npy $ONT YY_ONT_corrected.fq

It gave me :

[2020-10-09T11:23:14Z INFO  fmlrc2] Input parameters (required):
[2020-10-09T11:23:14Z INFO  fmlrc2]     BWT: "./MSBWT/comp_msbwt.npy"
[2020-10-09T11:23:14Z INFO  fmlrc2]     Input reads: "/storage-01/poultrylab1/zhaoqiangsen/nanopore/genome/fq_gz/YY_ONT.fq.gz"
[2020-10-09T11:23:14Z INFO  fmlrc2]     Output corrected reads: "YY_ONT_corrected.fq"
[2020-10-09T11:23:14Z INFO  fmlrc2] Execution Parameters:
[2020-10-09T11:23:14Z INFO  fmlrc2]     verbose: false
[2020-10-09T11:23:14Z INFO  fmlrc2]     threads: 1
[2020-10-09T11:23:14Z INFO  fmlrc2]     cache size: 10
[2020-10-09T11:23:14Z INFO  fmlrc2] Correction Parameters:
[2020-10-09T11:23:14Z INFO  fmlrc2]     reads to correct: [0, 18446744073709551615)
[2020-10-09T11:23:14Z INFO  fmlrc2]     k-mer sizes: [21, 59]
[2020-10-09T11:23:14Z INFO  fmlrc2]     abs. mininimum count: 5
[2020-10-09T11:23:14Z INFO  fmlrc2]     dyn. minimimum fraction: 0.1
[2020-10-09T11:23:14Z INFO  fmlrc2]     branching factor: 4
thread 'main' panicked at 'Could not read 7721032154 bytes of BWT body for file "./MSBWT/comp_msbwt.npy"', /storage-01/poultrylab1/.cargo/registry/src/mirrors.tuna.tsinghua.edu.cn-df7c3c540f42cdbd/fmlrc-0.1.1/src/bv_bwt.rs:163:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I also ran RUST_BACKTRACE=1 cargo run, it didn't work:

error: could not find `Cargo.toml` in `/storage-01/poultrylab1/zhaoqiangsen/GenomeAssembly/mallard/ONT` or any parent directory

please help me ~ Thank you in advance !!!!

holtjma commented 4 years ago

That's an interesting error to encounter, and it's from this block of code. This block is basically checking that the file described by the header matches the actual file length (i.e. to make sure you have all the data described by the header). My best guess is that something went wrong while creating your BWT file and it didn't finish writing/copying the data out.

How did you build the original BWT? And/or did you copy this BWT from somewhere and create a truncated file by accident? One thing you could potentially try is to load the file into python3's numpy (since it is supposed to match numpy specification). If it throws an error, there's a good chance your file is corrupted or incomplete for some reason.

As for the cargo run command, you would need to run that from inside the crate directory for it to work. This is why it's complaining about a missing Cargo.toml file.

Johnsonzcode commented 4 years ago

Thank you for quickly reply ! Yes I aslo check the code block, but I can't deal with it. The first time I creat my BWT files where BWT files were created one by one because of short reads are paired-ends. Code:

gunzip -c NDES00380_L4_1_clean.fq.gz | awk 'NR % 4 == 2' | sort | tr NT TN | ropebwt2 -LR | tr NT TN | fmlrc2-convert ./BWT/comp_msbwt.npy
gunzip -c NDES00380_L4_2_clean.fq.gz | awk 'NR % 4 == 2' | sort | tr NT TN | ropebwt2 -LR | tr NT TN | fmlrc2-convert ./BWT/comp_msbwt2.npy

It goes OK. But when I correct :

thread 'main' panicked at 'Could not read 7721032154 bytes of BWT body for file "./MSBWT/comp_msbwt.npy"', /storage-01/poultrylab1/.cargo/registry/src/mirrors.tuna.tsinghua.edu.cn-df7c3c540f42cdbd/fmlrc-0.1.1/src/bv_bwt.rs:163:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Then I guess something went wrong with BWT files, and merge the two BWT file:

#/usr/env python3
import numpy

npy1 = numpy.load('./comp_msbwt.npy')
npy2 = numpy.load('./comp_msbwt2.npy')
npy3 = numpy.append(npy1,npy2)
numpy.save('./msbwt.npy',npy3)

It still remain same error. Finally I rerun the create-BWT-process with merging short reads file into one file, and same eror happened:

### I want to speed up the process so use pigz but it seems pigz didn't work as multiple threads :
pigz -p 48 -d -c NDES00380_clean.fq.gz | awk 'NR % 4 == 2' | sort | tr NT TN | ropebwt2 -LR | tr NT TN | fmlrc2-convert ./MSBWT/msbwtIn1.npy
### the same code used as above section :
ONT=/storage-01/poultrylab1/zhaoqiangsen/nanopore/genome/fq_gz/YY_ONT.fq.gz
npy=./MSBWT/msbwtIn1.npy
fmlrc2 -t 60 $npy $ONT YY_ONT_corrected.fq
### and same error :
thread 'main' panicked at 'Could not read 15675140363 bytes of BWT body for file "./MSBWT/msbwtIn1.npy"', /storage-01/poultrylab1/.cargo/registry/src/mirrors.tuna.tsinghua.edu.cn-df7c3c540f42cdbd/fmlrc-0.1.1/src/bv_bwt.rs:163:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

The only difference I found is the BWT file size changed.

Is still my create-BWT-file-process went wrong ?

Thank you for your time !

holtjma commented 4 years ago

You definitely cannot merge the BWT files as you tried with numpy, that will give you an array but it will be nonsensical, you should instead do something like:

gunzip -c NDES00380_L4_1_clean.fq.gz NDES00380_L4_2_clean.fq.gz | awk 'NR % 4 == 2' | sort | tr NT TN | ropebwt2 -LR | tr NT TN | fmlrc2-convert ./MSBWT/comp_msbwt.npy

I'm still not sure why your single-end BWT is getting an error. Have you tried running the example on the original FMLRC? You should be able to replace the commands near the end with the equivalent fmlrc2 commands.

It would also be helpful to see the full command output from your BWT creation AND the size of the resulting file. For example, here is what a small one looks like using a piece of the example:

example1 matt$ gunzip -c ERR022075_1.fastq.gz | head -n 1000 | awk 'NR % 4 == 2' | sort | tr NT TN | ropebwt2 -LR | tr NT TN | fmlrc2-convert ./BWT/comp_msbwt.npy
[2020-10-10T03:33:15Z INFO  fmlrc2_convert] Input parameters (required):
[2020-10-10T03:33:15Z INFO  fmlrc2_convert]     Input BWT: "stdin"
[2020-10-10T03:33:15Z INFO  fmlrc2_convert]     Output BWT: "./BWT/comp_msbwt.npy"
[M::mr_insert_multi] Turn off parallelization for this batch as too few strings are left.
[M::main_ropebwt2] inserted 25250 symbols in 0.005 sec, 0.003 CPU sec
[M::main_ropebwt2] constructed FM-index in 0.022 sec, 0.004 CPU sec
[M::main_ropebwt2] symbol counts: ($, A, C, G, T, N) = (250, 6300, 6187, 6161, 250, 6102)
[M::main] Version: r187
[M::main] CMD: ropebwt2 -LR
[M::main] Real time: 0.025 sec; CPU: 0.014 sec
[2020-10-10T03:33:15Z INFO  fmlrc::bwt_converter] Converted BWT with symbol counts: [250, 6300, 6187, 6161, 250, 6102]
[2020-10-10T03:33:15Z INFO  fmlrc::bwt_converter] RLE-BWT byte length: 18387
[2020-10-10T03:33:15Z INFO  fmlrc2_convert] RLE-BWT conversion complete.
example1 matt$ ls -l ./BWT/comp_msbwt.npy 
-rw-r--r--  1 matt  staff  18483 Oct  9 22:33 ./BWT/comp_msbwt.npy

Johnsonzcode commented 4 years ago

OK, I see. I will try your code below, if it doesn't work, I will try original FMLRC exmple.

gunzip -c NDES00380_L4_1_clean.fq.gz NDES00380_L4_2_clean.fq.gz | awk 'NR % 4 == 2' | sort | tr NT TN | ropebwt2 -LR | tr NT TN | fmlrc2-convert ./MSBWT/comp_msbwt.npy

Yes, I don't know why I merge pair-ends short reads into one file with cat them it still got an error the last time:

cat NDES00380_L4_1_clean.fq.gz NDES00380_L4_2_clean.fq.gz > NDES00380_clean.fq.gz

Full BWT creation output and its size:

[2020-10-09T14:57:55Z INFO  fmlrc2_convert] Input parameters (required):
[2020-10-09T14:57:55Z INFO  fmlrc2_convert]     Input BWT: "stdin"
[2020-10-09T14:57:55Z INFO  fmlrc2_convert]     Output BWT: "./MSBWT/msbwtIn1.npy"
[M::main_ropebwt2] inserted 10415295819 symbols in 741.814 sec, 1908.697 CPU sec
[M::main_ropebwt2] inserted 10415295819 symbols in 836.397 sec, 2281.367 CPU sec
[M::main_ropebwt2] inserted 10415295819 symbols in 956.954 sec, 2720.248 CPU sec
[M::main_ropebwt2] inserted 10415295819 symbols in 995.544 sec, 2960.243 CPU sec
[M::main_ropebwt2] inserted 10415295819 symbols in 1109.195 sec, 3242.295 CPU sec
[M::main_ropebwt2] inserted 10415295819 symbols in 1158.872 sec, 3482.073 CPU sec
[M::main_ropebwt2] inserted 10415295819 symbols in 1124.942 sec, 3531.342 CPU sec
[M::main_ropebwt2] inserted 10415295819 symbols in 1208.258 sec, 3681.861 CPU sec
[M::main_ropebwt2] inserted 10415295819 symbols in 1181.353 sec, 3648.760 CPU sec
[M::main_ropebwt2] inserted 10415295819 symbols in 1142.386 sec, 3613.213 CPU sec
[M::main_ropebwt2] inserted 10415295819 symbols in 1267.457 sec, 3865.725 CPU sec
[M::main_ropebwt2] inserted 10415295819 symbols in 1275.224 sec, 3816.419 CPU sec
[M::main_ropebwt2] inserted 10415295819 symbols in 1310.821 sec, 4043.733 CPU sec
[M::main_ropebwt2] inserted 10415295819 symbols in 1368.313 sec, 4066.785 CPU sec
[M::main_ropebwt2] inserted 891711474 symbols in 186.869 sec, 542.725 CPU sec
[M::main_ropebwt2] constructed FM-index in 20514.184 sec, 47849.493 CPU sec
[M::main_ropebwt2] symbol counts: ($, A, C, G, T, N) = (971561940, 42041238760, 30895481919, 30930291857, 14434249, 41852844215)
[M::main] Version: r187
[M::main] CMD: ropebwt2 -LR
[M::main] Real time: 25067.126 sec; CPU: 50645.214 sec
[2020-10-09T21:55:42Z INFO  fmlrc::bwt_converter] Converted BWT with symbol counts: [971561940, 42041238760, 30895481919, 30930291857, 14434249, 41852844215]
[2020-10-09T21:55:42Z INFO  fmlrc::bwt_converter] RLE-BWT byte length: 15675140363
[2020-10-09T21:58:42Z INFO  fmlrc2_convert] RLE-BWT conversion complete.
(RNAseq) [poultrylab1@pbsnode01 ONT]$ ls -l MSBWT/
total 15307760
-rw-rw-r-- 1 poultrylab1 poultrylab1         171 Oct  9 18:32 merge_npy.py
-rw-rw-r-- 1 poultrylab1 poultrylab1 15675140459 Oct 10 05:58 msbwtIn1.npy
### short reads size:
(RNAseq) [poultrylab1@pbsnode01 2nd_WGS]$ ls -l
total 151681612
-rw-rw-r-- 1 poultrylab1 poultrylab1 77658884018 Oct  9 22:06 NDES00380_clean.fq.gz
-rwxr-xr-x 1 poultrylab1 poultrylab1 37069574339 Oct  3 01:56 NDES00380_L4_1_clean.fq.gz
-rwxr-xr-x 1 poultrylab1 poultrylab1 40589309679 Oct  3 02:03 NDES00380_L4_2_clean.fq.gz

It looks like normal.

holtjma commented 4 years ago

I think I might know what the issue is. Apparently, there is no guarantee that read(...) will go until the EOF, so it may be exiting early and fmlrc2 was not handling that.

I made a branch "bwt_input_checks" to test the minor fix on your data (I've already tested that I didn't break anything locally). Can you check it out via GitHub and see if you're still getting the error?

Let me know if you need any help with checkout/install.

holtjma commented 4 years ago

Just pushed another change that is likely more efficient but should do the same thing, see this commit: https://github.com/HudsonAlpha/rust-fmlrc/commit/b5003a4df423d0731888653ec9b545f2dda6775b

Johnsonzcode commented 4 years ago

I cloned the branch bwt_input_checks and built it, seems work ! But I don't understand. LOL

Johnsonzcode commented 4 years ago

Thank you so much !!!

holtjma commented 4 years ago

Great, I'm glad to hear it fixed the issue! I'm going to roll this into a new release on Monday and you should be able to install the normal way from there. Let me know if you encounter any other issues with fmlrc2.

tuannguyen8390 commented 2 years ago

Hi @holtjma, sorry to open this again but I personally think that you should bring the piece of code you advise in this thread

out toward the README section. I was feeling a bit unclear when reading the instruction as it didn't mention we can merge the paired-end read.

Just a suggestion tho. Thanks for the program, it worked really nice on PromethION & I will use it in our pipeline. Problem solved on my end :).

best,

holtjma commented 2 years ago

Yea, I don't think it will hurt to add that in the short term. I was working on a better way to construct them (see this link), but I haven't had time to work on that in the past few months unfortunately. I'll see if I can add something that cleanly explains it for folks, thanks for the suggestion and glad it's working for you!

tuannguyen8390 commented 2 years ago

Hi @holtjma, thanks for the reply, look great at the moment !

If you don't mind one last point. Maybe not from this thread, but I recalled someone mentioned that they think they need to use only 1 pair of corrected read (say /1) as nanopore is 'single-end' so to speak, I thought the same at the beginning. But it would be great to rectify that you can use the reverse read anyhow in the indexing step.

Tuan

holtjma commented 2 years ago

I'm not sure I understand the context here, can you elaborate or point me to where the discussion occurred?

tuannguyen8390 commented 2 years ago

This is the issue I mentioned previously #7

Initially, I have the same interpretation as the thread opener, as he thinks we only need to use only either forward read or reverse read from paired-end Illumina, not both of them.

Along your response to that is "Not only would you be losing information from your sequencing, but it's possible you may be injecting a strand bias by removing one end of the reads." - so yes, simply load both instead of just forward/reverse read 👍

Hope this clarifies my rubbish English (not a native speaker sorry !)

holtjma commented 2 years ago

Ah I see, I'll have to think about if there's a better way to address more complicated questions. I try to keep the README relatively light for people just landing on the page, but we can think about have a more in-depth FAQ and/or best practices somewhere else in the future (if I ever get time, haha!). Thanks for the suggestions!

p.s. your English is fine, I just have a bad memory for previous issues

tuannguyen8390 commented 2 years ago

FAQ is such a great idea, if (and only if) you ever have time to do one, please also inject the explanation behind the tr -s TN NT, the answer was included in one of the old FMLRC issue https://github.com/holtjma/fmlrc/issues/4.

I personally find it extremely helpful to fully understand the reasoning behind the command,. In addition, FAQ stops people from asking the same question again and again 🧠

Really enjoy the discussion. Thanks!

Stay safe,

Tuan

HudsonAlpha / fmlrc2

run error #1