bcgsc / NanoSim

Nanopore sequence read simulator
Other
217 stars 51 forks source link

NanoSim produces invalid fastq files #107

Open jkomyno opened 3 years ago

jkomyno commented 3 years ago

Hi, I've characterized and later simulated 20000 reads from the E. Coli genome. It seems that the simulated_aligned_reads.fastq file generated in the simulation phase isn't a valid fastq file, according to fqtools's validate command.

The characterization phase command is:

read_analysis.py genome \
  -i "/data/original/ecoli_R73_2D.fasta" \
  -rg "/data/original/ecoli_K12_MG1655_ref.fa" \
  -o "/data/training" \
  -a minimap2 \
  -t 4

The simulation phase command is:

simulator.py genome \
  -rg "./data/original/ecoli_K12_MG1655_ref.fa" \
  -c "./data/training/training" \
  -o "./data/simulated/simulated" \
  -n 20000 \
  -max 10000 \
  -min 100 \
  -b albacore \
  --seed 42 \
  -dna_type circular \
  --fastq \
  -t 4

fqtools command and validation error:

./fqtools validate ./data/simulated/simulated_aligned_reads.fastq 
ERROR [line 5]: expected header sequence

On the other hand, unaligned reads are ok:

./fqtools validate ./data/simulated/simulated_unaligned_reads.fastq 
OK
cheny19 commented 3 years ago

Hi @jkomyno , it seems that the error lies in line 5, so could you check what does line 5 look like?

jkomyno commented 3 years ago

I've added the simulated fastq file here (I'm sorry, I thought I had already linked it in the original issue, but I forgot).

Line 5 is the following:

@ENA|U00096|U00096_2138149;aligned_4_R_13_2748_29
cheny19 commented 3 years ago

It looks like a normal header generated by NanoSim. My intuition is that the ; is causing the problem. I quickly checked fqtools manual and it seems you can specify which character is expected. So if ; is not in the default list, the header is considered invalid. That being said, I'm not entirely sure what went wrong. And since I'm busy with my thesis these days, could you help try that and let me know how it works? Thanks!

jkomyno commented 3 years ago

Hi, I ran fqtools -p ';' validate ./data/simulated/simulated_aligned_reads.fastq, but I get the same error.

cheny19 commented 3 years ago

I thought you said there was no error with unaligned reads before?

jkomyno commented 3 years ago

That was a typo, sorry. I edited the comment so it's clearer.

jkomyno commented 3 years ago

Hi @cheny19, any update?

cheny19 commented 3 years ago

Hi @jkomyno, sorry for no update recently. I don't know much about the validity criteria about fqtools. Based on your comment in isONclust, it seems that the tool didn't read the quality score properly.

@theottlo, do you have any thoughts about this?

theottlo commented 3 years ago

Hi @jkomyno, I apologize for the delay! I was wondering which version of NanoSim you were using to simulate the reads. It looks like the sequence and quality score lengths are different in the aligned fastq file, which is a known bug in NanoSim v2.6.0 and is fixed in the v3.0.0 pre-release.

jkomyno commented 3 years ago

Hi @theottlo, I believe you have access to the fastq file. I have cloned the NanoSim repository some days after v3.0.0 was released.

cheny19 commented 3 years ago

Hi @jkomyno,

Sorry for the late reply. I finally got time to install fqtools now. I repeated your simulation command but with the pre-trained human DNA dataset models as input. I couldn't re-produce the error unfortunately. The validate results are OK for both aligned reads and unaligned reads. Could you make sure you are using the latest commit and try simulating with that pre-trained model again and see how it goes?

Cheers, Chen