bcgsc / NanoSim

Nanopore sequence read simulator
Other
217 stars 51 forks source link

Breaking changes in format output from 3.0.0-beta #132

Closed eboileau closed 2 years ago

eboileau commented 2 years ago

Description

The header format in the output of simulator.py has changed, and introduces breaking changes e.g. when extracting substrings from the header. This seems to affect the file _alignedreads.fasta, but not the file _unalignedreads.fasta.

In genome mode, the change appear to be here:

@@ -1261,8 +1302,8 @@ def simulation_aligned_genome(dna_type, min_l, max_l, median_l, sd_l, out_reads,
-                out_reads.write(id_begin + new_read_name + "_" + str(head) + "_" + str(sum(ref_length_list)) + "_" +
-                                str(tail) + '\n')
+                out_reads.write(id_begin + new_read_name + "_" + str(head) + "_" +
+                                ";".join(str(x) for x in seg_length_list) + "_" + str(tail) + '\n')

Note the ; instead of the previously used _, which e.g. results in

>1_27866045;aligned_1_F_90_5268_35

instead of

>1_27866045_aligned_1_F_90_5268_35

Is there a particular reason for this change, or do you think you could revert to the old format?

To reproduce

Run simulator.py with any version from 3.0.0-beta.

Environment

Python 3.7.6 conda 4.9.2 NanoSim (master, and from 3.0.0-beta)

cheny19 commented 2 years ago

Hi @eboileau , sorry for the late reply. Yes, we changed the header because we introduced the feature of chimeric reads to NanoSim v3. For chimeric reads, each read has more than one segments, so we used ;to separate the source of each segment.