CAMI-challenge / CAMISIM

CAMISIM: Simulating metagenomes and microbial communities
https://data.cami-challenge.org/participate
Apache License 2.0
168 stars 37 forks source link

NanoSim - KeyError: sequence_id not found in mapping #113

Closed skrakau closed 1 year ago

skrakau commented 3 years ago

Hi,

Thanks for developing CAMISIM! I am currently trying to simulate data with Illumina and Nanopore reads using the de novo community design. I am using the CAMISIM master branch. With the provided test data (CAMISIM/defaults/genomes/) and the provided mapping files I got it running using art and nanosim(from the https://github.com/abremgesfork).

Then I tried to use the 2nd CAMI Toy Mouse Gut Dataset genomes/, metadata.tsv and genome_to_id.tsv data as a basis to generate new data. For Illumina data this worked smoothly. However, for Nanopore data I get the following errors after simulating the reads and in the final anonymization step:

...
2021-07-09 16:17:40 DEBUG: [GenomePreparation 89018136530] 270448.0     22
2021-07-09 16:17:40 DEBUG: [GenomePreparation 89018136530] SysCmd: '/home-link/qeakr01/development/NanoSim/src/simulator.py linear -n 22 -r /sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/output_nanopore_test2/source_genomes/GCF_000403395.2_Anae_bact_G3_V1.fa -o /sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/2021.07.09_15.59.47_sample_0/reads/270448.0 -c tools/nanosim_profile/ecoli --seed 2998104995'
2021-07-09 16:17:40 INFO: [GenomePreparation 89018136530] Simulating reads from 270448.0: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/output_nanopore_test2/source_genomes/GCF_000403395.2_Anae_bact_G3_V1.fa'
2021-07-09 16:31:15 INFO: [GenomePreparation 89018136530] Simulating reads finished
[W::sam_parse1] urecognized reference name; treated as unmapped
[W::sam_parse1] urecognized reference name; treated as unmapped
[W::sam_parse1] urecognized reference name; treated as unmapped
...

and

...
2021-07-09 16:44:30 INFO: [MetagenomeSimulationPipeline] Anonymize Data
2021-07-09 16:44:30 DEBUG: [MetagenomeSimulationPipeline] /sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/tmpgY5Fcq
2021-07-09 16:44:30 INFO: [FastaAnonymizer] Shuffle and anonymize '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/2021.07.09_15.59.47_sample_0/reads'
2021-07-09 16:44:30 DEBUG: [FastaAnonymizer] get_seeded_random() { seed="$1"; openssl enc -aes-256-ctr -pass pass:"$seed" -nosalt < /dev/zero 2>/dev/null; }; python '/nfsmounts/home/qeakr01/development/CAMISIM/fastastreamer.py' -input '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/2021.07.09_15.59.47_sample_0/reads' -format 'fastq' -ext 'fq' -s | shuf -z --random-source=<(get_seeded_random 2944938622045856594) | tr -d '\000' | python '/nfsmounts/home/qeakr01/development/CAMISIM/anonymizer.py' -prefix 'S0R' -format 'fastq' -map '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/tmpoArF7B' -out '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/tmpLJDFBS' -s
2021-07-09 16:48:06 INFO: [MetadataReader 1434768039] Reading file: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/output_nanopore_test2/internal/genome_locations.tsv'
2021-07-09 16:48:08 INFO: [MetadataReader 31538633047] Reading file: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/output_nanopore_test2/internal/meta_data.tsv'
2021-07-09 16:48:08 INFO: [MetadataReader 14979527976] Reading file: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/tmpoArF7B'
2021-07-09 16:48:08 ERROR: [Validator 31115876351] sequence_id 'NZ-JH590862.1' not found in mapping

2021-07-09 16:48:08 DEBUG: [MetagenomeSimulationPipeline]
Traceback (most recent call last):
  File "/home-link/qeakr01/development/CAMISIM/metagenomesimulation.py", line 117, in run_pipeline
    self._anonymize_data(list_of_output_gsa, file_path_output_gsa_pooled)
  File "/home-link/qeakr01/development/CAMISIM/metagenomesimulation.py", line 639, in _anonymize_data
    file_path_genome_locations, file_path_metadata, file_path_anonymous_mapping_tmp, stream_output
  File "/nfsmounts/home/qeakr01/development/CAMISIM/scripts/GoldStandardFileFormat/goldstandardfileformat.py", line 370, in gs_read_mapping
    stream_output, dict_anonymous_to_read_id, dict_sequence_to_genome_id, dict_genome_id_to_tax_id)
  File "/nfsmounts/home/qeakr01/development/CAMISIM/scripts/GoldStandardFileFormat/goldstandardfileformat.py", line 244, in write_gs_read_mapping
    raise KeyError(msg)
KeyError: "sequence_id 'NZ-JH590862.1' not found in mapping\n"

2021-07-09 16:48:08 ERROR: [MetagenomeSimulationPipeline] "sequence_id 'NZ-JH590862.1' not found in mapping\n" in line 117
2021-07-09 16:48:08 INFO: [MetagenomeSimulationPipeline] Metagenome simulation aborted
2021-07-09 16:48:08 INFO: [MetagenomeSimulationPipeline] Temporary data stored at:
/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM

Do you have any idea what could cause this issue or how I could proceed to fix this?

sim_nanosim.test2.log sim_config.nanosim.test2.ini.txt

AlphaSquad commented 3 years ago

Hi, thank you for your interest in CAMISIM! There are some known problems regarding NanoSim in the master branch and I have taken efforts to fix them. You can test these efforts using the nanosim branch on the one hand and using the python3 branch on the other hand. Unfortunately, I didn't get around to extensively test both of these branches so far, that's why they have not been merged into the master branch. My first suggestion would be to test one (or both) of the branches and see if the problem persists (or another one comes up). Also note that CAMISIM was tested with NanoSim version 2.5.0 and the latest NanoSim 3.0 has changed the output formats. I am currently updating the python3 branch to be able to use this version, but it is likely that it is not yet possible to use NanoSim 3.0 Finally, if you want to use the master branch, you could try running the simulation without the anonymization step to see if the problem is only related to the anonymization (if you don't necessarily need the anonymized sequences).

skrakau commented 3 years ago

Hi, Thanks a lot for your quick reply! When skipping the anonymization step I still get an error

...
2021-07-13 11:30:46 INFO: [MetagenomeSimulationPipeline] Creating binning gold standard
2021-07-13 11:30:46 DEBUG: [MetagenomeSimulationPipeline] /sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmpTQnFar/tmpF6ohpM
2021-07-13 11:30:46 INFO: [MetadataReader 10620567601] Reading file: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/output_nanopore_test2/internal/genome_locations.tsv'
2021-07-13 11:30:49 INFO: [MetadataReader 89207668721] Reading file: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/output_nanopore_test2/internal/meta_data.tsv'
2021-07-13 11:31:05 INFO: [MetadataReader 31538633047] Reading file: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmpTQnFar/read_start_positionsm6IA1n'
2021-07-13 11:31:06 DEBUG: [MetagenomeSimulationPipeline]
Traceback (most recent call last):
  File "/home-link/qeakr01/development/CAMISIM/metagenomesimulation.py", line 122, in run_pipeline
    self._create_binning_gs(list_of_output_gsa)
  File "/home-link/qeakr01/development/CAMISIM/metagenomesimulation.py", line 523, in _create_binning_gs
    genome_id = dict_sequence_to_genome_id[gen_id]
KeyError: 'NZ'

2021-07-13 11:31:06 ERROR: [MetagenomeSimulationPipeline] 'NZ' in line 122
2021-07-13 11:31:06 INFO: [MetagenomeSimulationPipeline] Metagenome simulation aborted
2021-07-13 11:31:06 INFO: [MetagenomeSimulationPipeline] Temporary data stored at:
/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmpTQnFar

If I see it right, there are quite a few changes in the master branch (and used for the CAMI II challenge), which were not included in the NanoSim branch. So I am a bit unsure if I should really use this, since I would like to use the same version for both Illumina and Nanopore reads (also for reproducibility reasons).

Does the latest python3 branch commit work still with NanoSim v2.5.0?

AlphaSquad commented 3 years ago

I understand and that seems reasonable to me. The python3 branch should still support the older version of NanoSim, the goal is that it will support both versions.

skrakau commented 3 years ago

Ok, I tried with the python3 branch with type=nanosim3 and NanoSim 2.5.0 (since with type=nanosim the command line arguments did not fit to v2.5.0), but I get the same error when skipping the anonymization step:

2021-07-15 11:44:57 INFO: [MetagenomeSimulationPipeline] Creating binning gold standard
2021-07-15 11:44:57 DEBUG: [MetagenomeSimulationPipeline] /sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmpvzvogbkb/tmpi4skoh_9
2021-07-15 11:44:57 INFO: [MetadataReader 83241305234] Reading file: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/output_nanopore_py3/internal/genome_locations.tsv'
2021-07-15 11:44:59 INFO: [MetadataReader 22970976076] Reading file: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/output_nanopore_py3/internal/meta_data.tsv'
2021-07-15 11:45:03 INFO: [MetadataReader 46738966671] Reading file: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmpvzvogbkb/read_start_positions514h5bu3'
2021-07-15 11:45:03 DEBUG: [MetagenomeSimulationPipeline]
Traceback (most recent call last):
  File "/home-link/qeakr01/development/CAMISIM/metagenomesimulation.py", line 122, in run_pipeline
    self._create_binning_gs(list_of_output_gsa)
  File "/home-link/qeakr01/development/CAMISIM/metagenomesimulation.py", line 523, in _create_binning_gs
    genome_id = dict_sequence_to_genome_id[gen_id]
KeyError: 'NZ'

2021-07-15 11:45:03 ERROR: [MetagenomeSimulationPipeline] 'NZ' in line 122
2021-07-15 11:45:03 INFO: [MetagenomeSimulationPipeline] Metagenome simulation aborted

This time I additionally get the following errors, but I guess they are unrelated to the one above

[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1] parse error at line 3
[main_samview] truncated file.
[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1] parse error at line 50
[main_samview] truncated file.
...

Any further ideas by any chance?

AlphaSquad commented 3 years ago

The nanosim3 command line is thought for NanoSim 3.0, I might have introduced errors for the older NanoSim version, I will investigate. The errors in the CIGAR are an error in NanoSim itself (Also see here: https://github.com/bcgsc/NanoSim/issues/128 ) but seem to be fixed now. I still had this error come up occasionally though. I already fixed some small errors yesterday, but looking at the name there (NZ vs NZ-JH590862.1 in your first post) it seems like there is still something off with - and _ in sequence names. I am sorry that this is proving to be such a pain. You could help me a lot by retrying with the latest changes from yesterday and running CAMISIM with the debug option and sending me the log (+exact command you started it with) again, so I can take a deeper dive as to what is happening there.

skrakau commented 3 years ago

Thanks for the info! I am bit confused. It seems NanoSim v2.5.0 requires already to be called with genome mode and -dna_type linear, while in the current python3 branch only the ReadSimulationNanosim3 class addresses this, but not the ReadSimulationNanosim class if I see it right. I get an according error with type=nanosim complaining about this, but I don't have the log file anymore, sorry.

Which NanoSim version would you recommend to use then?

AlphaSquad commented 3 years ago

Ah oh no, I am also confused, the update for NanoSim 2.5 was only introduced on the nanosim branch - but as you noticed, that branch is relatively old and never got merged into master (and thus, not into the python3 branch). nanosim3 as simulator expects NanoSim version 3.0

skrakau commented 3 years ago

I just run it with your changes from yesterday in the python3 branch, with NanoSim 2.5.0, anonymous=False, and type=nanosim3, and it finished without the error and I got reads in my output folder :) Great!

But from what you say, I would conclude I better shouldn't trust these results, since there is some difference between NanoSim 2.5.0 and 3.0, right?

Maybe I could either use NanoSim 1.2.0 (as used in the past, right?) then with type=nanosimor NanoSim 3.0 then with type=nanosim3. What would you think would be more promising?

AlphaSquad commented 3 years ago

Even though I think that if 2.5.0 finished without errors your results probably are usable, I would use the latest NanoSim 3.0 if it works. The model used in 1.2.0 is very old so it probably does not reflect recent chemistry well.

skrakau commented 3 years ago

With NanoSim 3.0.0 it somehow got stuck at the end of the read simulation, i.e. for a small test case it never got further and seemed NanoSim was still running after almost a day. No idea why. Will try to work with 2.5.0 for now then.

skrakau commented 3 years ago

Ok, just in case this is of any value, when I run NanoSim 2.5.0 with anonymous=True I get

...
2021-07-17 11:38:14 INFO: [MetagenomeSimulationPipeline] Anonymize Data
2021-07-17 11:38:14 DEBUG: [MetagenomeSimulationPipeline] /sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp_nanosim/tmpi2i7wyhe/tmpmizrura2
2021-07-17 11:38:14 INFO: [FastaAnonymizer] Shuffle and anonymize '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp_nanosim/tmpi2i7wyhe/2021.07.17_11.05.12_sample_0/reads'
2021-07-17 11:38:14 DEBUG: [FastaAnonymizer] get_seeded_random() { seed="$1"; openssl enc -aes-256-ctr -pass pass:"$seed" -nosalt < /dev/zero 2>/dev/null; }; python '/nfsmounts/home/qeakr01/development/CAMISIM/fastastreamer.py' -input '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp_nanosim/tmpi2i7wyhe/2021.07.17_11.05.12_sample_0/reads' -format 'fastq' -ext 'fq' -s | shuf -z --random-source=<(get_seeded_random 1156758267001379443) | tr -d '\000' | python '/nfsmounts/home/qeakr01/development/CAMISIM/anonymizer.py' -prefix 'S0R' -format 'fastq' -map '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp_nanosim/tmpi2i7wyhe/tmpadwk3ljp' -out '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp_nanosim/tmpi2i7wyhe/tmp7vogw2q1' -s
2021-07-17 11:38:35 INFO: [MetadataReader 91442079578] Reading file: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/output_nanopore_test/internal/genome_locations.tsv'
2021-07-17 11:38:35 INFO: [MetadataReader 22447665523] Reading file: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/output_nanopore_test/internal/meta_data.tsv'
2021-07-17 11:38:35 INFO: [MetadataReader 68413921856] Reading file: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp_nanosim/tmpi2i7wyhe/tmpadwk3ljp'
2021-07-17 11:38:35 INFO: [FastaAnonymizer 46581866542] Shuffle and anonymize '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp_nanosim/tmpi2i7wyhe/tmpmizrura2'
2021-07-17 11:38:35 DEBUG: [FastaAnonymizer 46581866542] get_seeded_random() { seed="$1"; openssl enc -aes-256-ctr -pass pass:"$seed" -nosalt < /dev/zero 2>/dev/null; }; python '/nfsmounts/home/qeakr01/development/CAMISIM/fastastreamer.py' -input '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp_nanosim/tmpi2i7wyhe/tmpmizrura2' -format 'fasta' -s | shuf -z --random-source=<(get_seeded_random 2869180421932437275) | tr -d '\000' | python '/nfsmounts/home/qeakr01/development/CAMISIM/anonymizer.py' -prefix 'S0C' -format 'fasta' -map '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp_nanosim/tmpi2i7wyhe/tmp_cnqezse' -out '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp_nanosim/tmpi2i7wyhe/tmpg_rs8gnp' -s
2021-07-17 11:38:37 INFO: [MetadataReader 39419803297] Reading file: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/output_nanopore_test/internal/genome_locations.tsv'
2021-07-17 11:38:38 INFO: [MetadataReader 47215089013] Reading file: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/output_nanopore_test/internal/meta_data.tsv'
2021-07-17 11:38:38 INFO: [MetadataReader 39184802387] Reading file: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp_nanosim/tmpi2i7wyhe/read_start_positions593higdq'
2021-07-17 11:38:38 INFO: [MetadataReader 40650277617] Reading file: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp_nanosim/tmpi2i7wyhe/tmp_cnqezse'
2021-07-17 11:38:38 DEBUG: [MetagenomeSimulationPipeline]
Traceback (most recent call last):
  File "/home-link/qeakr01/development/CAMISIM/metagenomesimulation.py", line 117, in run_pipeline
    self._anonymize_data(list_of_output_gsa, file_path_output_gsa_pooled)
  File "/home-link/qeakr01/development/CAMISIM/metagenomesimulation.py", line 682, in _anonymize_data
    list_file_paths_read_positions, stream_output
  File "/nfsmounts/home/qeakr01/development/CAMISIM/scripts/GoldStandardFileFormat/goldstandardfileformat.py", line 343, in gs_contig_mapping
    dict_sequence_name_to_anonymous = self.get_dict_sequence_name_to_anonymous(file_path_id_map)
  File "/nfsmounts/home/qeakr01/development/CAMISIM/scripts/GoldStandardFileFormat/goldstandardfileformat.py", line 161, in get_dict_sequence_name_to_anonymous
    dict_mapping = table.get_map(0, 1)
  File "/nfsmounts/home/qeakr01/development/CAMISIM/scripts/MetaDataTable/metadatatable.py", line 612, in get_map
    assert self.has_column(key_column_name), "Column '{}' not found!".format(key_column_name)
AssertionError: Column '0' not found!

2021-07-17 11:38:38 ERROR: [MetagenomeSimulationPipeline] Column '0' not found! in line 117
2021-07-17 11:38:38 INFO: [MetagenomeSimulationPipeline] Metagenome simulation aborted

With anonymous=False it finishes, although with the CIGAR errors. Do you think they impair the resulting reads?

And one other question, would it be possible to create a tag for the current python3 branch? Then I could reference to this particular version when using the data for a manuscript.

AlphaSquad commented 3 years ago

Hm, that about NanoSim 3.0 is unfortunate, it didn't happen to me, but I will try to find out what is happening, similarly for the anonymization step. Do you have the complete log for that still? Sometimes there is small errors, similar to the CIGAR errors which cause these problems downstream. The CIGAR errors mean the sam/bam files produced by CAMISIM will have some erroneous lines, the reads shouldn't be affected by that. And the sam files also should be fine with the exception of a few reads. Since there still seem to be some errors, I am somewhat hesitant to give an "official" tag to this branch. I will try to fix them and then merge everything back to master and tag it. If you need a tag rather sooner than later, you could create a fork of the latest commit to the python3 branch and create a tag for this version on your fork?

skrakau commented 3 years ago

OK, thanks for the info again! It run through successfully also for the non-test setup now :) I forked it and created a tag/zenodo id.

Attached you find the log file for the combination NanoSim 2.5.0 and anonymous=True using the python3 branch, sorry for the delay. sim_nanosim.test.log

skrakau commented 3 years ago

and I just saw that for the Nanopore simulation (anonymous=False) the contigs/gsa.fasta.gz files seem incomplete, i.e. they contain only one or two contig sequences.

Moreover, I specified size=5.0 for both Illumina and Nanopore reads, but while the resulting Illumina files are ~5GB in size, the Nanopore files are ~1GB (I simulated 5 samples).

AlphaSquad commented 3 years ago

Ah, yeah the size is a problem. Since NanoSim requires the number of reads as input and CAMISIM the dataset size, there has to be a conversion from size -> number of reads. But the number of reads needed for a certain size depends on the average read length - which is specific to the trained models. I updated the used model but did not update the average read size. The fact that this happens points towards the fact that the calculation should be automatic depending on the chosen model.

Also thank you for the log (and information about the non-anonymous gold standards). I hope to find the problems soon - but will be on vacation until 16th of August starting this Friday

AlphaSquad commented 3 years ago

It seems like there is still some bug somewhere in NanoSim/the length calculation of the reads, so some CIGARs were not correct after all, causing samtools to sometimes crash (a lot of undeterministic behaviour there, that's why it took me a while to figure out). I still have to find out whether my CIGAR calculation or the read lengths/errors reported by NanoSim are at fault, but for the time being I disabled the creation of correct CIGARs which should resolve the error you reported in the latest python3 commit, please have a look! Note that every file can still be used and the position of the reads will still be correct, it is just the CIGARs in the bam-file which will be incorrect.

AlphaSquad commented 1 year ago

This should be fixed in the later versions of CAMISIM and Nanosim though CIGAR calculation is still deactivated currently. The latest Nanosim version which will be part of the big CAMISIM update is able to produce fastq reads directly, so no need for converting fasta to fastq within CAMISIM