bcgsc / NanoSim

Nanopore sequence read simulator
Other
217 stars 51 forks source link

Invalid fastq format #181

Closed xinehc closed 1 year ago

xinehc commented 1 year ago

Hi,

Sometimes the fastq output of NanoSim (v3.1) does not contain a valid header line. Here are two examples.

No header line:

image

Header contains 108932256 <0x00> and missing the leading character @:

image

I generated 25 datasets (with different relative abundance) using the same genomes (all circular) as input. 2 out of 25 got this so maybe it is a random problem? The output of NanoSim does not report any error or interruption.

Full command:

subprocess.run([
    'src/simulator.py', 
    'metagenome',
    '-a', _abundance,
    '-gl', _location,
    '-dl', _topology,
    '-c', '/pre-trained_models/metagenome_ERR3152366_Log/training',
    '-o', filename,
    '-t', str(ncpus),
    '--fastq', '-b', 'albacore',
    '--seed', '0'], check=True)

Output

2023-01-17 04:22:12: Number of reads simulated >> 1
2023-01-17 04:22:25: Number of reads simulated >> 10001
2023-01-17 04:22:35: Number of reads simulated >> 20001
2023-01-17 04:22:43: Number of reads simulated >> 30001
2023-01-17 04:22:50: Number of reads simulated >> 40001
2023-01-17 04:22:57: Number of reads simulated >> 50001
2023-01-17 04:23:04: Number of reads simulated >> 60001
2023-01-17 04:23:11: Number of reads simulated >> 70001
2023-01-17 04:23:17: Number of reads simulated >> 80001
2023-01-17 04:23:23: Number of reads simulated >> 90001
2023-01-17 04:23:29: Number of reads simulated >> 100001
2023-01-17 04:23:35: Number of reads simulated >> 110001
2023-01-17 04:23:41: Number of reads simulated >> 120001
2023-01-17 04:23:46: Number of reads simulated >> 130001
2023-01-17 04:23:52: Number of reads simulated >> 140001
2023-01-17 04:23:57: Number of reads simulated >> 150001
2023-01-17 04:24:02: Number of reads simulated >> 160001
2023-01-17 04:24:07: Number of reads simulated >> 170001
2023-01-17 04:24:12: Number of reads simulated >> 180001
2023-01-17 04:24:17: Number of reads simulated >> 190001
2023-01-17 04:24:21: Number of reads simulated >> 200001
2023-01-17 04:24:26: Number of reads simulated >> 210001
2023-01-17 04:24:30: Number of reads simulated >> 220001
2023-01-17 04:24:34: Number of reads simulated >> 230001
2023-01-17 04:24:40: Number of reads simulated >> 240001
2023-01-17 04:24:44: Number of reads simulated >> 250001
2023-01-17 04:24:48: Number of reads simulated >> 260001
2023-01-17 04:24:52: Number of reads simulated >> 270001
2023-01-17 04:24:57: Number of reads simulated >> 280001
2023-01-17 04:25:01: Number of reads simulated >> 290001
2023-01-17 04:25:05: Number of reads simulated >> 300001
2023-01-17 04:25:08: Number of reads simulated >> 310001
2023-01-17 04:25:11: Number of reads simulated >> 320001
2023-01-17 04:25:15: Number of reads simulated >> 330001
2023-01-17 04:25:18: Number of reads simulated >> 340001

2023-01-17 04:25:21: Number of reads simulated >> 350001

2023-01-17 04:25:24: Number of reads simulated >> 360001

2023-01-17 04:25:34: Number of reads simulated >> 370001

2023-01-17 04:26:52: Start simulation of random reads
2023-01-17 04:27:01: Number of reads simulated >> 380001
2023-01-17 04:27:09: Number of reads simulated >> 390001
2023-01-17 04:27:19: Finished!

Traceback (most recent call last):
  File "sim.py", line 114, in <module>
    simulate_nanopore(prefix, k, metadata)
  File "sim.py", line 98, in simulate_nanopore
    cnt[accession] += int(line.split('_')[-2])
IndexError: list index out of range
kmnip commented 1 year ago

It looks like this is a duplicate of #168 This should have been fixed in the master branch. Can you please confirm?

We still need to make a new release that incorporates the bugfix.

xinehc commented 1 year ago

I am using fb967aac2ab733067f955414aaad7a59b144c58a.

This one is wired, I just tried to reproduce the results using the same input files & seed but the problem is gone for unknown reasons. Maybe the problem is memory related? The 108932256 <0x00> seems very suspicious.

I will install the latest version and rerun the whole simulation later to see if I could find a minimal reproducible example.

kmnip commented 1 year ago

It looks like this is not the same as #168, which does have an empty header line with leading '>' character. Your case actually looks slightly different -- there was no header line at all in your first screenshot.

I suspect that you had more than one output being written to the same file, messing up the file.

xinehc commented 1 year ago

I rerun the simulation yesterday using the latest version and no error was detected. It seems the problem has been solved or is related to my system configuration. Thanks!