ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
518 stars 111 forks source link

Official release issue running "Stats and logging thread has quit" #226

Open drabe004 opened 4 years ago

drabe004 commented 4 years ago

Running this on a cluster with a virtual env via an interactive session:

Getting this error which kills the job:

Traceback (most recent call last): File "/home/mcgaughs/drabe004/.conda/envs/cactus_deps2/bin/cactus", line 11, in load_entry_point('Cactus==1.0', 'console_scripts', 'cactus')() File "/home/mcgaughs/drabe004/.conda/envs/cactus_deps2/lib/python3.6/site-packages/cactus/progressive/cactus_progressive.py", line 403, in main runCactusProgressive(options) File "/home/mcgaughs/drabe004/.conda/envs/cactus_deps2/lib/python3.6/site-packages/cactus/progressive/cactus_progressive.py", line 451, in runCactusProgressive halID = toil.start(RunCactusPreprocessorThenProgressiveDown(options, project, memory=configWrapper.getDefaultMemory())) File "/home/mcgaughs/drabe004/.conda/envs/cactus_deps2/lib/python3.6/site-packages/toil/common.py", line 811, in start return self._runMainLoop(rootJobGraph) File "/home/mcgaughs/drabe004/.conda/envs/cactus_deps2/lib/python3.6/site-packages/toil/common.py", line 1102, in _runMainLoop jobCache=self._jobCache).run() File "/home/mcgaughs/drabe004/.conda/envs/cactus_deps2/lib/python3.6/site-packages/toil/leader.py", line 223, in run self.innerLoop() File "/home/mcgaughs/drabe004/.conda/envs/cactus_deps2/lib/python3.6/site-packages/toil/leader.py", line 561, in innerLoop self.statsAndLogging.check() File "/home/mcgaughs/drabe004/.conda/envs/cactus_deps2/lib/python3.6/site-packages/toil/statsAndLogging.py", line 203, in check raise RuntimeError("Stats and logging thread has quit") RuntimeError: Stats and logging thread has quit

drabe004 commented 4 years ago

Running this successfully on a cluster and have now edited/fixed genome headers. We're still getting an error (debug log attached). Not sure what this is. We're run the evolvermammals example with the exact same commands:

cd /panfs/roc/groups/14/mcgaughs/drabe004/CF_genomes module load python3 source activate cactus_deps2

export TMPDIR=/scratch.global/drabe004 cactus ./jobstore3 ./fish5_sequencefile.txt ./fish5.hal --realTimeLogging --logFile=debug-log2.txt debug-log2.txt

tree file also attached fish5_sequencefile.txt

diekhans commented 4 years ago

this is dying with cactus_blast_chunkSequences throwing the error: Exception: no chunks produced for files: ['/scratch.global/drabe004/node-0a40cc64-674b-45e5-908d-a5c308075e97-b97d4739-5ed0-4e3c-90e2-b0b4d618d896/tmpalu1_hhb/c8e9b637-47cb-40d2-9204-1e25e6999d70/tmp_rjhd7bj.tmp', '/scratch.global/drabe004/node-0a40cc64-674b-45e5-908d-a5c308075e97-b97d4739-5ed0-4e3c-90e2-b0b4d618d896/tmpalu1_hhb/c8e9b637-47cb-40d2-9204-1e25e6999d70/tmp6rdnkn4z.tmp']

@glennhickey @joelarmstrong any idea what causes this `

glennhickey commented 4 years ago

Somehow cactus isn't able to read the input Fastas. From up the log you have stuff like

 Before preprocessing, got assembly stats for genome Astyanax_mexicanus_pachon: Total-sequences: 0 Total-length: 0 Proportion-repeat-masked: -nan ProportionNs: -nan Total-Ns: 0 N50: 0 Median-sequence-length: 0 Max-sequence-length: 0 Min-sequence-length: 0

It is treating all the inputs as empty, which eventually trips the exception. They must either not be valid fasta files, or have some kind of formatting that cactus isn't suspecting.

drabe004 commented 4 years ago

Hi Glenn and Mark,

Thanks so much for the quick reply! Hmm ok in previous iterations we've removed the character with sed -e '/>/s/_/ /' -- but I see in the example files that > is at the beginning of each assembly name. Can I just confirm the allowed characters in the cactus code for ammebly/scaffold names? I suspect this may be the issue with formatting.

As a second note-- wondering if there is any pre-packaged code that converts ensebll/ncbi genome files to the required header format for cactus? No problem to make our own, just asking in case someone already has this available.

Thanks again for the help!

Best,

~Danielle

On Tue, May 12, 2020 at 9:13 AM Glenn Hickey notifications@github.com wrote:

Somehow cactus isn't able to read the input Fastas. From up the log you have stuff like

Before preprocessing, got assembly stats for genome Astyanax_mexicanus_pachon: Total-sequences: 0 Total-length: 0 Proportion-repeat-masked: -nan ProportionNs: -nan Total-Ns: 0 N50: 0 Median-sequence-length: 0 Max-sequence-length: 0 Min-sequence-length: 0

It is treating all the inputs as empty, which eventually trips the exception. They must either not be valid fasta files, or have some kind of formatting that cactus isn't suspecting.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/cactus/issues/226#issuecomment-627371875, or unsubscribe https://github.com/notifications/unsubscribe-auth/APEQMAC4N4WHMJFEMOENU23RRFKQJANCNFSM4MTEA72Q .

-- Danielle H Drabeck M.Sc.

PhD Student Department of Ecology, Evolution, and BehaviorUniversity of Minnesota

Drabe004@umn.edu Danielle.Drabeck@gmail.com


“I do not know what I may appear to the world, but to myself I seem to have been only like a boy playing on the sea-shore, and diverting myself now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me.” ― Issac Newton

glennhickey commented 4 years ago

Your fasta description lines must begin with >. We should provide an explicit error message right away if that's not the case (#237).

By default, it looks like spaces and tabs are not allowed, nor are alphanumeric characters that aren't _ - : . You should get a reasonable error message for these cases though

diekhans commented 4 years ago

Each fasta record should start with the '>' followed by a name. This error is different. Could you share one of your fasta files after editing?

Mark

drabe004 notifications@github.com writes:

Hi Glenn and Mark,

Thanks so much for the quick reply! Hmm ok in previous iterations we've removed the character with sed -e '/>/s/_/ /' -- but I see in the example files that > is at the beginning of each assembly name. Can I just confirm the allowed characters in the cactus code for ammebly/scaffold names? I suspect this may be the issue with formatting.

As a second note-- wondering if there is any pre-packaged code that converts ensebll/ncbi genome files to the required header format for cactus? No problem to make our own, just asking in case someone already has this available.

Thanks again for the help!

Best,

~Danielle

On Tue, May 12, 2020 at 9:13 AM Glenn Hickey notifications@github.com wrote:

Somehow cactus isn't able to read the input Fastas. From up the log you have stuff like

Before preprocessing, got assembly stats for genome Astyanax_mexicanus_pachon: Total-sequences: 0 Total-length: 0 Proportion-repeat-masked: -nan ProportionNs: -nan Total-Ns: 0 N50: 0 Median-sequence-length: 0 Max-sequence-length: 0 Min-sequence-length: 0

It is treating all the inputs as empty, which eventually trips the exception. They must either not be valid fasta files, or have some kind of formatting that cactus isn't suspecting.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/cactus/issues/226#issuecomment-627371875, or unsubscribe https://github.com/notifications/unsubscribe-auth/APEQMAC4N4WHMJFEMOENU23RRFKQJANCNFSM4MTEA72Q .

-- Danielle H Drabeck M.Sc.

PhD Student Department of Ecology, Evolution, and BehaviorUniversity of Minnesota

Drabe004@umn.edu Danielle.Drabeck@gmail.com


“I do not know what I may appear to the world, but to myself I seem to have been only like a boy playing on the sea-shore, and diverting myself now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me.” ― Issac Newton

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/ComparativeGenomicsToolkit/cactus/issues/226#issuecomment-627485114Hi Glenn and Mark,

Thanks so much for the quick reply! Hmm ok in previous iterations we've removed the character with sed -e '/>/s/_/ /' -- but I see in the example files that > is at the beginning of each assembly name. Can I just confirm the allowed characters in the cactus code for ammebly/scaffold names? I suspect this may be the issue with formatting.

As a second note-- wondering if there is any pre-packaged code that converts ensebll/ncbi genome files to the required header format for cactus? No problem to make our own, just asking in case someone already has this available.

Thanks again for the help!

Best,

~Danielle

On Tue, May 12, 2020 at 9:13 AM Glenn Hickey notifications@github.com wrote:

Somehow cactus isn't able to read the input Fastas. From up the log you have stuff like

Before preprocessing, got assembly stats for genome Astyanax_mexicanus_pachon: Total-sequences: 0 Total-length: 0 Proportion-repeat-masked: -nan ProportionNs: -nan Total-Ns: 0 N50: 0 Median-sequence-length: 0 Max-sequence-length: 0 Min-sequence-length: 0

It is treating all the inputs as empty, which eventually trips the exception. They must either not be valid fasta files, or have some kind of formatting that cactus isn't suspecting.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/cactus/issues/226# issuecomment-627371875, or unsubscribe https://github.com/notifications/unsubscribe-auth/ APEQMAC4N4WHMJFEMOENU23RRFKQJANCNFSM4MTEA72Q .

-- Danielle H Drabeck M.Sc.

PhD Student Department of Ecology, Evolution, and BehaviorUniversity of Minnesota

Drabe004@umn.edu Danielle.Drabeck@gmail.com


“I do not know what I may appear to the world, but to myself I seem to have been only like a boy playing on the sea-shore, and diverting myself now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me.” ― Issac Newton

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.*

drabe004 commented 4 years ago

ah, ok I think I misconstrued that then and removed the > from the files completely I'll re-edit and run and perhaps if I get the same error I'll compress and upload my edited .fasta files to see what may be causing further issues...

Thanks for the quick reply!

Best,

~Danielle

On Tue, May 12, 2020 at 12:47 PM Glenn Hickey notifications@github.com wrote:

Your fasta description lines must begin with >. We should provide an explicit error message right away if that's not the case (#237 https://github.com/ComparativeGenomicsToolkit/cactus/issues/237).

By default, it looks like spaces and tabs https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/src/cactus/preprocessor/checkUniqueHeaders.py#L10-L12 are not allowed, nor are alphanumeric characters https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/src/cactus/preprocessor/checkUniqueHeaders.py#L19-L21 that aren't _ - : . You should get a reasonable error message for these cases though

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/cactus/issues/226#issuecomment-627494232, or unsubscribe https://github.com/notifications/unsubscribe-auth/APEQMAFMXZ7RYO4O4DARANLRRGDUDANCNFSM4MTEA72Q .

-- Danielle H Drabeck M.Sc.

PhD Student Department of Ecology, Evolution, and BehaviorUniversity of Minnesota

Drabe004@umn.edu Danielle.Drabeck@gmail.com


“I do not know what I may appear to the world, but to myself I seem to have been only like a boy playing on the sea-shore, and diverting myself now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me.” ― Issac Newton

drabe004 commented 4 years ago

OK So I have run this again with genomes edited with the simple sed scripts: sed -e 's/ //g' GCF_001660625.1_IpCoco_1.2_genomic.fna >IPCoco_fixed.fasta sed -e 's/[ \t]*//' IPCoco_fixed.fasta >IPCoco_fixed2.fasta

The job ran for about 500hour and then quit-- debug attached. It seems there is still a header issue with genomes downloaded from NCBI: The offending header: OMKO01000001.1Typhlichthyssubterraneusgenomeassembly,contig:scf7180003279999,wholegenomeshotgunsequence The offending header: NC_030416.1IctaluruspunctatusbreedUSDA103chromosome1,IpCoco_1.2,wholegenomeshotgunsequence The offending header: NW_015623055.1SinocyclocheilusrhinocerousisolateXijiaounplacedgenomicscaffold,SAMN03320098_v1.1scaffold2266,wholegenomeshotgunsequence

However, the fasta was clearly readable: got assembly stats for genome Sinocyclocheilus_rhinocerous: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmpwt32cdcb/70cd2790-f3a0-49cd-bace-5a2e0b07a77e/tmpcd4swo_d.tmp Total-sequences: 164173 Total-length: 1655786410 Proportion-repeat-masked: 0.433866 ProportionNs: 0.081114 Total-Ns: 134307852 N50: 945738 Median-sequence-length: 478 Max-sequence-length: 5035720 Min-sequence-length: 200

got assembly stats for genome Typhlichthys_subterraneus: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmpmc2mu2sa/f2590e05-337e-4647-801f-92c5eeb0a35e/tmp56wbyr9u.tmp Total-sequences: 84841 Total-length: 555559596 Proportion-repeat-masked: 0.304814 ProportionNs: 0.001280 Total-Ns: 711086 N50: 9654 Median-sequence-length: 4320 Max-sequence-length: 103791 Min-sequence-length: 64

got assembly stats for genome Ictalurus_punctatus: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmp7j9z7knc/4d197236-4391-45e2-906e-27a03737ec08/tmp4596k045.tmp Total-sequences: 9341 Total-length: 783274721 Proportion-repeat-masked: 0.345783 ProportionNs: 0.014531 Total-Ns: 11381915 N50: 27425808 Median-sequence-length: 1066 Max-sequence-length: 37510255 Min-sequence-length: 500

got assembly stats for genome Astyanax_mexicanus_pachon: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmpnf8s_nwc/6abdf26e-2531-4ed9-9322-be599bbd6e50/tmp8o4urbj6.tmp Total-sequences: 10735 Total-length: 1191242572 Proportion-repeat-masked: 0.190564 ProportionNs: 0.190564 Total-Ns: 227007640 N50: 1775308 Median-sequence-length: 3230 Max-sequence-length: 9823298 Min-sequence-length: 876

got assembly stats for genome Astyanax_mexicanus: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmpequb2qrb/2ed1c4fd-9117-4ee8-8052-9d758906cc28/tmpi6yb34gl.tmp Total-sequences: 2415 Total-length: 1335239194 Proportion-repeat-masked: 0.032685 ProportionNs: 0.032685 Total-Ns: 43642769 N50: 35377769 Median-sequence-length: 47283 Max-sequence-length: 74127438 Min-sequence-length: 1866

I'm attaching the debug here ---I will share the genome files via google drive. debug-log.txt

drabe004 commented 4 years ago

HI All, Just updated my ticket--- seems like there is still a header issue-- Link to one of the genomes with issues is here: https://drive.google.com/file/d/1Muu8M0tTTTUD0DjIwCu3Sm5QZ_5u0ZY-/view?usp=sharing

Thanks!

~Danielle

On Tue, May 12, 2020 at 1:18 PM Danielle Drabeck drabe004@umn.edu wrote:

ah, ok I think I misconstrued that then and removed the > from the files completely I'll re-edit and run and perhaps if I get the same error I'll compress and upload my edited .fasta files to see what may be causing further issues...

Thanks for the quick reply!

Best,

~Danielle

On Tue, May 12, 2020 at 12:47 PM Glenn Hickey notifications@github.com wrote:

Your fasta description lines must begin with >. We should provide an explicit error message right away if that's not the case (#237 https://github.com/ComparativeGenomicsToolkit/cactus/issues/237).

By default, it looks like spaces and tabs https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/src/cactus/preprocessor/checkUniqueHeaders.py#L10-L12 are not allowed, nor are alphanumeric characters https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/src/cactus/preprocessor/checkUniqueHeaders.py#L19-L21 that aren't _ - : . You should get a reasonable error message for these cases though

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/cactus/issues/226#issuecomment-627494232, or unsubscribe https://github.com/notifications/unsubscribe-auth/APEQMAFMXZ7RYO4O4DARANLRRGDUDANCNFSM4MTEA72Q .

-- Danielle H Drabeck M.Sc.

PhD Student Department of Ecology, Evolution, and BehaviorUniversity of Minnesota

Drabe004@umn.edu Danielle.Drabeck@gmail.com


“I do not know what I may appear to the world, but to myself I seem to have been only like a boy playing on the sea-shore, and diverting myself now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me.” ― Issac Newton

-- Danielle H Drabeck M.Sc.

PhD Student Department of Ecology, Evolution, and BehaviorUniversity of Minnesota

Drabe004@umn.edu Danielle.Drabeck@gmail.com


“I do not know what I may appear to the world, but to myself I seem to have been only like a boy playing on the sea-shore, and diverting myself now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me.” ― Issac Newton

diekhans commented 4 years ago

could you add the URLs for the NCBI fasta files to this ticket.

drabe004 notifications@github.com writes:

OK So I have run this again with genomes edited with the simple sed scripts: sed -e 's/ //g' GCF_001660625.1_IpCoco_1.2_genomic.fna >IPCoco_fixed.fasta sed -e 's/[ \t]*//' IPCoco_fixed.fasta >IPCoco_fixed2.fasta

The job ran for about 500hour and then quit-- debug attached. It seems there is still a header issue with genomes downloaded from NCBI: The offending header: OMKO01000001.1Typhlichthyssubterraneusgenomeassembly,contig:scf7180003279999,wholegenomeshotgunsequence The offending header: NC_030416.1IctaluruspunctatusbreedUSDA103chromosome1,IpCoco_1.2,wholegenomeshotgunsequence The offending header: NW_015623055.1SinocyclocheilusrhinocerousisolateXijiaounplacedgenomicscaffold,SAMN03320098_v1.1scaffold2266,wholegenomeshotgunsequence

However, the fasta was clearly readable: got assembly stats for genome Sinocyclocheilus_rhinocerous: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmpwt32cdcb/70cd2790-f3a0-49cd-bace-5a2e0b07a77e/tmpcd4swo_d.tmp Total-sequences: 164173 Total-length: 1655786410 Proportion-repeat-masked: 0.433866 ProportionNs: 0.081114 Total-Ns: 134307852 N50: 945738 Median-sequence-length: 478 Max-sequence-length: 5035720 Min-sequence-length: 200

got assembly stats for genome Typhlichthys_subterraneus: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmpmc2mu2sa/f2590e05-337e-4647-801f-92c5eeb0a35e/tmp56wbyr9u.tmp Total-sequences: 84841 Total-length: 555559596 Proportion-repeat-masked: 0.304814 ProportionNs: 0.001280 Total-Ns: 711086 N50: 9654 Median-sequence-length: 4320 Max-sequence-length: 103791 Min-sequence-length: 64

got assembly stats for genome Ictalurus_punctatus: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmp7j9z7knc/4d197236-4391-45e2-906e-27a03737ec08/tmp4596k045.tmp Total-sequences: 9341 Total-length: 783274721 Proportion-repeat-masked: 0.345783 ProportionNs: 0.014531 Total-Ns: 11381915 N50: 27425808 Median-sequence-length: 1066 Max-sequence-length: 37510255 Min-sequence-length: 500

got assembly stats for genome Astyanax_mexicanus_pachon: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmpnf8s_nwc/6abdf26e-2531-4ed9-9322-be599bbd6e50/tmp8o4urbj6.tmp Total-sequences: 10735 Total-length: 1191242572 Proportion-repeat-masked: 0.190564 ProportionNs: 0.190564 Total-Ns: 227007640 N50: 1775308 Median-sequence-length: 3230 Max-sequence-length: 9823298 Min-sequence-length: 876

got assembly stats for genome Astyanax_mexicanus: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmpequb2qrb/2ed1c4fd-9117-4ee8-8052-9d758906cc28/tmpi6yb34gl.tmp Total-sequences: 2415 Total-length: 1335239194 Proportion-repeat-masked: 0.032685 ProportionNs: 0.032685 Total-Ns: 43642769 N50: 35377769 Median-sequence-length: 47283 Max-sequence-length: 74127438 Min-sequence-length: 1866

I'm attaching the debug here ---I will share the genome files via google drive. debug-log.txt

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/ComparativeGenomicsToolkit/cactus/issues/226#issuecomment-628767969 OK So I have run this again with genomes edited with the simple sed scripts: sed -e 's/ //g' GCF_001660625.1_IpCoco_1.2_genomic.fna >IPCoco_fixed.fasta sed -e 's/[ \t]*//' IPCoco_fixed.fasta >IPCoco_fixed2.fasta

The job ran for about 500hour and then quit-- debug attached. It seems there is still a header issue with genomes downloaded from NCBI: The offending header: OMKO01000001.1Typhlichthyssubterraneusgenomeassembly,contig:scf7180003279999,wholegenomeshotgunsequence The offending header: NC_030416.1IctaluruspunctatusbreedUSDA103chromosome1,IpCoco_1.2,wholegenomeshotgunsequence The offending header: NW_015623055.1SinocyclocheilusrhinocerousisolateXijiaounplacedgenomicscaffold,SAMN03320098_v1.1scaffold2266,wholegenomeshotgunsequence

However, the fasta was clearly readable: got assembly stats for genome Sinocyclocheilus_rhinocerous: Input-sample: / scratch.global/drabe004/ node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/ tmpwt32cdcb/70cd2790-f3a0-49cd-bace-5a2e0b07a77e/tmpcd4swo_d.tmp Total-sequences: 164173 Total-length: 1655786410 Proportion-repeat-masked: 0.433866 ProportionNs: 0.081114 Total-Ns: 134307852 N50: 945738 Median-sequence-length: 478 Max-sequence-length: 5035720 Min-sequence-length: 200

got assembly stats for genome Typhlichthys_subterraneus: Input-sample: / scratch.global/drabe004/ node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/ tmpmc2mu2sa/f2590e05-337e-4647-801f-92c5eeb0a35e/tmp56wbyr9u.tmp Total-sequences: 84841 Total-length: 555559596 Proportion-repeat-masked: 0.304814 ProportionNs: 0.001280 Total-Ns: 711086 N50: 9654 Median-sequence-length: 4320 Max-sequence-length: 103791 Min-sequence-length: 64

got assembly stats for genome Ictalurus_punctatus: Input-sample: / scratch.global/drabe004/ node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/ tmp7j9z7knc/4d197236-4391-45e2-906e-27a03737ec08/tmp4596k045.tmp Total-sequences: 9341 Total-length: 783274721 Proportion-repeat-masked: 0.345783 ProportionNs: 0.014531 Total-Ns: 11381915 N50: 27425808 Median-sequence-length: 1066 Max-sequence-length: 37510255 Min-sequence-length: 500

got assembly stats for genome Astyanax_mexicanus_pachon: Input-sample: / scratch.global/drabe004/ node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/ tmpnf8s_nwc/6abdf26e-2531-4ed9-9322-be599bbd6e50/tmp8o4urbj6.tmp Total-sequences: 10735 Total-length: 1191242572 Proportion-repeat-masked: 0.190564 ProportionNs: 0.190564 Total-Ns: 227007640 N50: 1775308 Median-sequence-length: 3230 Max-sequence-length: 9823298 Min-sequence-length: 876

got assembly stats for genome Astyanax_mexicanus: Input-sample: /scratch.global /drabe004/ node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/ tmpequb2qrb/2ed1c4fd-9117-4ee8-8052-9d758906cc28/tmpi6yb34gl.tmp Total-sequences: 2415 Total-length: 1335239194 Proportion-repeat-masked: 0.032685 ProportionNs: 0.032685 Total-Ns: 43642769 N50: 35377769 Median-sequence-length: 47283 Max-sequence-length: 74127438 Min-sequence-length: 1866

I'm attaching the debug here ---I will share the genome files via google drive. debug-log.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.*

drabe004 commented 4 years ago

sure thing!

Sinocyclocheilus_rhinocerous https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/515/625/GCF_001515625.1_SAMN03320098_v1.1/GCF_001515625.1_SAMN03320098_v1.1_genomic.fna.gz

Typhlichthys_subterraneus https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/302/405/GCA_900302405.1_ASM90030240v1/GCA_900302405.1_ASM90030240v1_genomic.fna.gz

Ictalurus_punctatus https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/660/625/GCF_001660625.1_IpCoco_1.2/GCF_001660625.1_IpCoco_1.2_genomic.fna.gz

drabe004 commented 4 years ago

This is a valid fasta file, although the ids you have generated are going to be painful to use in the long run.

What was the problem with the original sed command I sent?

Danielle Drabeck drabe004@umn.edu writes:

HI All, Just updated my ticket--- seems like there is still a header issue-- Link to one of the genomes with issues is here: https://drive.google.com/file/d/1Muu8M0tTTTUD0DjIwCu3Sm5QZ_5u0ZY-/view?usp=sharing

Thanks!

~Danielle

On Tue, May 12, 2020 at 1:18 PM Danielle Drabeck drabe004@umn.edu wrote:

ah, ok I think I misconstrued that then and removed the > from the files completely I'll re-edit and run and perhaps if I get the same error I'll compress and upload my edited .fasta files to see what may be causing further issues...

Thanks for the quick reply!

Best,

~Danielle

On Tue, May 12, 2020 at 12:47 PM Glenn Hickey notifications@github.com wrote:

Your fasta description lines must begin with >. We should provide an explicit error message right away if that's not the case (#237 https://github.com/ComparativeGenomicsToolkit/cactus/issues/237).

By default, it looks like spaces and tabs https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/src/cactus/preprocessor/checkUniqueHeaders.py#L10-L12 are not allowed, nor are alphanumeric characters https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/src/cactus/preprocessor/checkUniqueHeaders.py#L19-L21 that aren't _ - : . You should get a reasonable error message for these cases though

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/cactus/issues/226#issuecomment-627494232, or unsubscribe https://github.com/notifications/unsubscribe-auth/APEQMAFMXZ7RYO4O4DARANLRRGDUDANCNFSM4MTEA72Q .

-- Danielle H Drabeck M.Sc.

PhD Student Department of Ecology, Evolution, and BehaviorUniversity of Minnesota

Drabe004@umn.edu Danielle.Drabeck@gmail.com


“I do not know what I may appear to the world, but to myself I seem to have been only like a boy playing on the sea-shore, and diverting myself now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me.” ― Issac Newton

-- Danielle H Drabeck M.Sc.

PhD Student Department of Ecology, Evolution, and BehaviorUniversity of Minnesota

Drabe004@umn.edu Danielle.Drabeck@gmail.com


“I do not know what I may appear to the world, but to myself I seem to have been only like a boy playing on the sea-shore, and diverting myself now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me.” ― Issac Newton HI All,  Just updated my ticket--- seems like there is still a header issue--  Link to one of the genomes with issues is here:  https://drive.google.com/file/d/1Muu8M0tTTTUD0DjIwCu3Sm5QZ_5u0ZY-/view?usp= sharing

Thanks! 

~Danielle 

On Tue, May 12, 2020 at 1:18 PM Danielle Drabeck drabe004@umn.edu wrote:

ah, ok I think I misconstrued that then and removed the > from the files
completely
I'll re-edit and run and perhaps if I get the same error I'll compress and
upload my edited .fasta files to see what may be causing further issues... 

Thanks for the quick reply! 

Best, 

~Danielle 

On Tue, May 12, 2020 at 12:47 PM Glenn Hickey <notifications@github.com>
wrote:

    Your fasta description lines must begin with >. We should provide an
    explicit error message right away if that's not the case (#237).

    By default, it looks like spaces and tabs are not allowed, nor are
    alphanumeric characters that aren't _ - : . You should get a reasonable
    error message for these cases though

    —
    You are receiving this because you authored the thread.
    Reply to this email directly, view it on GitHub, or unsubscribe.*

--
Danielle H Drabeck M.Sc.
PhD Student
Department of Ecology, Evolution, and Behavior
University of Minnesota

Drabe004@umn.edu
Danielle.Drabeck@gmail.com
___________________________________________________________________________
“I do not know what I may appear to the world, but to myself I seem to have
been only like a boy playing on the sea-shore, and diverting myself now and
then finding a smoother pebble or a prettier shell than ordinary, whilst
the great ocean of truth lay all undiscovered before me.”
― Issac Newton

-- Danielle H Drabeck M.Sc. PhD Student Department of Ecology, Evolution, and Behavior University of Minnesota

Drabe004@umn.edu Danielle.Drabeck@gmail.com


“I do not know what I may appear to the world, but to myself I seem to have been only like a boy playing on the sea-shore, and diverting myself now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me.” ― Issac Newton

diekhans commented 4 years ago

I see: RuntimeError: An invalid character was found in the first word of a fasta header. Acceptable characters for headers in an assembly hub include alphanumeric characters plus '_', '-', ':', and '.'. Please modify your headers to eliminate other characters. The offending header: OMKO01000001.1Typhlichthyssubterraneusgenomeassembly,contig:scf7180003279999,wholegenomeshotgunsequence

drabe004 notifications@github.com writes:

OK So I have run this again with genomes edited with the simple sed scripts: sed -e 's/ //g' GCF_001660625.1_IpCoco_1.2_genomic.fna >IPCoco_fixed.fasta sed -e 's/[ \t]*//' IPCoco_fixed.fasta >IPCoco_fixed2.fasta

The job ran for about 500hour and then quit-- debug attached. It seems there is still a header issue with genomes downloaded from NCBI: The offending header: OMKO01000001.1Typhlichthyssubterraneusgenomeassembly,contig:scf7180003279999,wholegenomeshotgunsequence The offending header: NC_030416.1IctaluruspunctatusbreedUSDA103chromosome1,IpCoco_1.2,wholegenomeshotgunsequence The offending header: NW_015623055.1SinocyclocheilusrhinocerousisolateXijiaounplacedgenomicscaffold,SAMN03320098_v1.1scaffold2266,wholegenomeshotgunsequence

However, the fasta was clearly readable: got assembly stats for genome Sinocyclocheilus_rhinocerous: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmpwt32cdcb/70cd2790-f3a0-49cd-bace-5a2e0b07a77e/tmpcd4swo_d.tmp Total-sequences: 164173 Total-length: 1655786410 Proportion-repeat-masked: 0.433866 ProportionNs: 0.081114 Total-Ns: 134307852 N50: 945738 Median-sequence-length: 478 Max-sequence-length: 5035720 Min-sequence-length: 200

got assembly stats for genome Typhlichthys_subterraneus: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmpmc2mu2sa/f2590e05-337e-4647-801f-92c5eeb0a35e/tmp56wbyr9u.tmp Total-sequences: 84841 Total-length: 555559596 Proportion-repeat-masked: 0.304814 ProportionNs: 0.001280 Total-Ns: 711086 N50: 9654 Median-sequence-length: 4320 Max-sequence-length: 103791 Min-sequence-length: 64

got assembly stats for genome Ictalurus_punctatus: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmp7j9z7knc/4d197236-4391-45e2-906e-27a03737ec08/tmp4596k045.tmp Total-sequences: 9341 Total-length: 783274721 Proportion-repeat-masked: 0.345783 ProportionNs: 0.014531 Total-Ns: 11381915 N50: 27425808 Median-sequence-length: 1066 Max-sequence-length: 37510255 Min-sequence-length: 500

got assembly stats for genome Astyanax_mexicanus_pachon: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmpnf8s_nwc/6abdf26e-2531-4ed9-9322-be599bbd6e50/tmp8o4urbj6.tmp Total-sequences: 10735 Total-length: 1191242572 Proportion-repeat-masked: 0.190564 ProportionNs: 0.190564 Total-Ns: 227007640 N50: 1775308 Median-sequence-length: 3230 Max-sequence-length: 9823298 Min-sequence-length: 876

got assembly stats for genome Astyanax_mexicanus: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmpequb2qrb/2ed1c4fd-9117-4ee8-8052-9d758906cc28/tmpi6yb34gl.tmp Total-sequences: 2415 Total-length: 1335239194 Proportion-repeat-masked: 0.032685 ProportionNs: 0.032685 Total-Ns: 43642769 N50: 35377769 Median-sequence-length: 47283 Max-sequence-length: 74127438 Min-sequence-length: 1866

I'm attaching the debug here ---I will share the genome files via google drive. debug-log.txt

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/ComparativeGenomicsToolkit/cactus/issues/226#issuecomment-628767969 OK So I have run this again with genomes edited with the simple sed scripts: sed -e 's/ //g' GCF_001660625.1_IpCoco_1.2_genomic.fna >IPCoco_fixed.fasta sed -e 's/[ \t]*//' IPCoco_fixed.fasta >IPCoco_fixed2.fasta

The job ran for about 500hour and then quit-- debug attached. It seems there is still a header issue with genomes downloaded from NCBI: The offending header: OMKO01000001.1Typhlichthyssubterraneusgenomeassembly,contig:scf7180003279999,wholegenomeshotgunsequence The offending header: NC_030416.1IctaluruspunctatusbreedUSDA103chromosome1,IpCoco_1.2,wholegenomeshotgunsequence The offending header: NW_015623055.1SinocyclocheilusrhinocerousisolateXijiaounplacedgenomicscaffold,SAMN03320098_v1.1scaffold2266,wholegenomeshotgunsequence

However, the fasta was clearly readable: got assembly stats for genome Sinocyclocheilus_rhinocerous: Input-sample: / scratch.global/drabe004/ node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/ tmpwt32cdcb/70cd2790-f3a0-49cd-bace-5a2e0b07a77e/tmpcd4swo_d.tmp Total-sequences: 164173 Total-length: 1655786410 Proportion-repeat-masked: 0.433866 ProportionNs: 0.081114 Total-Ns: 134307852 N50: 945738 Median-sequence-length: 478 Max-sequence-length: 5035720 Min-sequence-length: 200

got assembly stats for genome Typhlichthys_subterraneus: Input-sample: / scratch.global/drabe004/ node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/ tmpmc2mu2sa/f2590e05-337e-4647-801f-92c5eeb0a35e/tmp56wbyr9u.tmp Total-sequences: 84841 Total-length: 555559596 Proportion-repeat-masked: 0.304814 ProportionNs: 0.001280 Total-Ns: 711086 N50: 9654 Median-sequence-length: 4320 Max-sequence-length: 103791 Min-sequence-length: 64

got assembly stats for genome Ictalurus_punctatus: Input-sample: / scratch.global/drabe004/ node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/ tmp7j9z7knc/4d197236-4391-45e2-906e-27a03737ec08/tmp4596k045.tmp Total-sequences: 9341 Total-length: 783274721 Proportion-repeat-masked: 0.345783 ProportionNs: 0.014531 Total-Ns: 11381915 N50: 27425808 Median-sequence-length: 1066 Max-sequence-length: 37510255 Min-sequence-length: 500

got assembly stats for genome Astyanax_mexicanus_pachon: Input-sample: / scratch.global/drabe004/ node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/ tmpnf8s_nwc/6abdf26e-2531-4ed9-9322-be599bbd6e50/tmp8o4urbj6.tmp Total-sequences: 10735 Total-length: 1191242572 Proportion-repeat-masked: 0.190564 ProportionNs: 0.190564 Total-Ns: 227007640 N50: 1775308 Median-sequence-length: 3230 Max-sequence-length: 9823298 Min-sequence-length: 876

got assembly stats for genome Astyanax_mexicanus: Input-sample: /scratch.global /drabe004/ node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/ tmpequb2qrb/2ed1c4fd-9117-4ee8-8052-9d758906cc28/tmpi6yb34gl.tmp Total-sequences: 2415 Total-length: 1335239194 Proportion-repeat-masked: 0.032685 ProportionNs: 0.032685 Total-Ns: 43642769 N50: 35377769 Median-sequence-length: 47283 Max-sequence-length: 74127438 Min-sequence-length: 1866

I'm attaching the debug here ---I will share the genome files via google drive. debug-log.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.*

diekhans commented 4 years ago

OH!!! cactus is merging the comment in the FASTA header with the id, creating a huge mess. This is very wrong.

So the original ids in the file are wonderful

OMKO01000001.1 Typhlichthys subterraneus genome assembly, contig: scf7180003279999, whole genome shotgun sequence

The id is OMKO01000001.1, then cactus corrupts it, then it tells you that it is invalid.

If you just do : sed -e '/>/s/ .*$//' GCA_900302405.1_ASM90030240v1_genomic.fna > GCA_900302405.1_ASM90030240v1_genomic.clean.fa

it will create fasta files that cactus will not corrupt..

drabe004 commented 4 years ago

OK awesome! I will give this a shot and re-run!