unexpected char in string error

ChristyPeterson commented 4 years ago

Hi Chad,

I'm trying to run panseq on some publically available genomes, and was successful when running the genomes from a subspecies. As soon as I included two other subspecies, I get an "unexpected char in string" error. Weirdly, this error is coming up in strains that were successful in the first run. Those characters do not exist in the input so I'm assuming its in a temp file the program is writing and then referring back to?

Below is an example from the Master log file (the top and bottom).

2019/12/10 14:29:27 INFO |  NovelIterator.pm:186> We have 74 genomes this run 
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
1: PREPARING DATA
Unexpected character `7' in string NZ_CP016054.1_Treponema_pallidum_subsp._pallidum_strain_PT_SIF1127_genome
Unexpected character `4' in string NZ_CP016054.1_Treponema_pallidum_subsp._pallidum_strain_PT_SIF1127_genome
Unexpected character `4' in string NZ_CP016054.1_Treponema_pallidum_subsp._pallidum_strain_PT_SIF1127_genome

Unexpected character `.' in string NZ_CP016054.1_Treponema_pallidum_subsp._pallidum_strain_PT_SIF1127_genome_(1138930..1144388)
2019/12/10 14:30:00 WARN |  CombineFilesIntoSingleFile.pm:83> Skipping /PATH/vpd/syphilis/panseq/run2-all-strains/6665952b07be10cc3db02af26d6d6f3a_5616179e4a49b14a8e4caa454f9b6f58_NR as it has size of 0 
2019/12/10 14:30:00 INFO |  Panseq.pm:268> Panseq mode set as pan 
2019/12/10 14:30:00 INFO |  SegmentMaker.pm:164> Segmenting /PATH/vpd/syphilis/panseq/run2-all-strains/6665952b07be10cc3db02af26d6d6f3a_5616179e4a49b14a8e4caa454f9b6f58 into 500bp segments

If I remove the isolate from the analysis I get even more of these errors, for several other isolates. Any insight would be awesome.

Thanks! -Christy

chadlaing commented 4 years ago

Hi Christy,

Is it possible one of the sequences isn't in valid fasta format? It looks similar to errors of that type. If not, could you send me the config file and link the public genomes that cause the error?

Thanks, Chad

ChristyPeterson commented 4 years ago

The file listed as being problematic looks like a valid fasta to me. Also, this file went through the first run successfully.

>NZ_CP016054.1 Treponema pallidum subsp. pallidum strain PT_SIF1127 genome
TAGATGGACGCAGTAGGGTATGAAGTATTCTGGAACGAGACACTCAGCCAGATACGGAGTGAATCGACCGAAGCAGAATT
TAACATGTGGTTTGCTCATTTGTTCTTTATCGCATCTTTTGAAAACGCTATCGAAATAGCAGTACCTTCAGACTTTTTCC
GAATACAGTTTAGCCAAAAATATCAAGAAAAGCTTGAGCGCAAGTTCCTCGAACTTTCTGGACACCCCATTAAACTTTTG
TTTGCCGTTAAAAAAGGCACCCCTCATGGAAATACTGCTCCCCCCAAACACGTGCATACCTACCTGGAGAAAAACTCTCC
TGCAGAGGTTCCTTCCAAAAAGAGCTTTCACCCCGACCTGAACAGAGACTATACCTTCGAGAACTTTGTATCCGGAGAAG
AAACCAAATTCAGCCATAGCGCTGCTATCTCCGTATCAAAAAACCCAGGCACTTCCTACAATCCGTTACTTATCTACGGT
GGAGTGGGACTAGGAAAAACCCACCTTATGCAGGCTATTGGACACGAGATCTACAAGACAACAGACCTGAACGTCATATA
CGTCACTGCGGAGAATTTTGGAAATGAATTCATTTCCACATTACTCAATAAAAAGACCCAGGATTTTAAAAAAAAATACC
GCTACACCGCGGATGTACTTCTTATAGATGACATTCATTTTTTTGAAAACAAAGACGGATTACAAGAAGAGCTTTTCTAT
ACGTTCAACGAACTTTTCGAGAAAAAAAAACAAATTATCTTTACCTGCGACAGGCCTGTACAAGAATTGAAAAATCTCTC
TTCTCGCTTACGCTCGAGGTGCTCCCGAGGGCTTAGCACTGATCTGAATATGCCATGTTTTGAAACGCGCTGTGCTATCT

I did check using grep for any weird characters and nothing pops up outside of the header.

I've attached two lists:

acc-list-full.txt is the full list of accessions used for this run.
acc-list-add.txt are the accessions that were added to run1 (completed successfully) to make up this run (full list).

acc-list-add.txt acc-list-full.txt

I looked through all the fasta in the 'add' txt file, and none of those have any weird characters in the sequence.

For the config file, do you mean the settings file?

ChristyPeterson commented 4 years ago

In case you meant the settings file to run panseq, I've attached it below, though altered the pathways to where stuff is located

queryDirectory  PATH/ncbi_assemblies/ncbi-genomes-2019-12-06/
baseDirectory   PATH/panseq/run2-all-strains
numberOfCores   20
mummerDirectory /PATH/bin/
blastDirectory  /PATH/bin/
minimumNovelRegionSize  500
novelRegionFinderMode   no_duplicates
muscleExecutable        /PATH/bin/muscle
fragmentationSize       500
percentIdentityCutoff   85
coreGenomeThreshold     2
runMode         pan

chadlaing commented 4 years ago

Perfect, I will take a look.

chadlaing / Panseq

unexpected char in string error #29