Open drabe004 opened 4 years ago
Running this successfully on a cluster and have now edited/fixed genome headers. We're still getting an error (debug log attached). Not sure what this is. We're run the evolvermammals example with the exact same commands:
cd /panfs/roc/groups/14/mcgaughs/drabe004/CF_genomes module load python3 source activate cactus_deps2
export TMPDIR=/scratch.global/drabe004 cactus ./jobstore3 ./fish5_sequencefile.txt ./fish5.hal --realTimeLogging --logFile=debug-log2.txt debug-log2.txt
tree file also attached fish5_sequencefile.txt
this is dying with cactus_blast_chunkSequences throwing the error:
Exception: no chunks produced for files: ['/scratch.global/drabe004/node-0a40cc64-674b-45e5-908d-a5c308075e97-b97d4739-5ed0-4e3c-90e2-b0b4d618d896/tmpalu1_hhb/c8e9b637-47cb-40d2-9204-1e25e6999d70/tmp_rjhd7bj.tmp', '/scratch.global/drabe004/node-0a40cc64-674b-45e5-908d-a5c308075e97-b97d4739-5ed0-4e3c-90e2-b0b4d618d896/tmpalu1_hhb/c8e9b637-47cb-40d2-9204-1e25e6999d70/tmp6rdnkn4z.tmp']
@glennhickey @joelarmstrong any idea what causes this `
Somehow cactus isn't able to read the input Fastas. From up the log you have stuff like
Before preprocessing, got assembly stats for genome Astyanax_mexicanus_pachon: Total-sequences: 0 Total-length: 0 Proportion-repeat-masked: -nan ProportionNs: -nan Total-Ns: 0 N50: 0 Median-sequence-length: 0 Max-sequence-length: 0 Min-sequence-length: 0
It is treating all the inputs as empty, which eventually trips the exception. They must either not be valid fasta files, or have some kind of formatting that cactus isn't suspecting.
Hi Glenn and Mark,
Thanks so much for the quick reply! Hmm ok in previous iterations we've removed the character with sed -e '/>/s/_/ /' -- but I see in the example files that > is at the beginning of each assembly name. Can I just confirm the allowed characters in the cactus code for ammebly/scaffold names? I suspect this may be the issue with formatting.
As a second note-- wondering if there is any pre-packaged code that converts ensebll/ncbi genome files to the required header format for cactus? No problem to make our own, just asking in case someone already has this available.
Thanks again for the help!
Best,
~Danielle
On Tue, May 12, 2020 at 9:13 AM Glenn Hickey notifications@github.com wrote:
Somehow cactus isn't able to read the input Fastas. From up the log you have stuff like
Before preprocessing, got assembly stats for genome Astyanax_mexicanus_pachon: Total-sequences: 0 Total-length: 0 Proportion-repeat-masked: -nan ProportionNs: -nan Total-Ns: 0 N50: 0 Median-sequence-length: 0 Max-sequence-length: 0 Min-sequence-length: 0
It is treating all the inputs as empty, which eventually trips the exception. They must either not be valid fasta files, or have some kind of formatting that cactus isn't suspecting.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/cactus/issues/226#issuecomment-627371875, or unsubscribe https://github.com/notifications/unsubscribe-auth/APEQMAC4N4WHMJFEMOENU23RRFKQJANCNFSM4MTEA72Q .
-- Danielle H Drabeck M.Sc.
PhD Student Department of Ecology, Evolution, and BehaviorUniversity of Minnesota
Drabe004@umn.edu Danielle.Drabeck@gmail.com
“I do not know what I may appear to the world, but to myself I seem to have been only like a boy playing on the sea-shore, and diverting myself now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me.” ― Issac Newton
Your fasta description lines must begin with >
. We should provide an explicit error message right away if that's not the case (#237).
By default, it looks like spaces and tabs are not allowed, nor are alphanumeric characters that aren't _ - : .
You should get a reasonable error message for these cases though
Each fasta record should start with the '>' followed by a name. This error is different. Could you share one of your fasta files after editing?
Mark
drabe004 notifications@github.com writes:
Hi Glenn and Mark,
Thanks so much for the quick reply! Hmm ok in previous iterations we've removed the character with sed -e '/>/s/_/ /' -- but I see in the example files that > is at the beginning of each assembly name. Can I just confirm the allowed characters in the cactus code for ammebly/scaffold names? I suspect this may be the issue with formatting.
As a second note-- wondering if there is any pre-packaged code that converts ensebll/ncbi genome files to the required header format for cactus? No problem to make our own, just asking in case someone already has this available.
Thanks again for the help!
Best,
~Danielle
On Tue, May 12, 2020 at 9:13 AM Glenn Hickey notifications@github.com wrote:
Somehow cactus isn't able to read the input Fastas. From up the log you have stuff like
Before preprocessing, got assembly stats for genome Astyanax_mexicanus_pachon: Total-sequences: 0 Total-length: 0 Proportion-repeat-masked: -nan ProportionNs: -nan Total-Ns: 0 N50: 0 Median-sequence-length: 0 Max-sequence-length: 0 Min-sequence-length: 0
It is treating all the inputs as empty, which eventually trips the exception. They must either not be valid fasta files, or have some kind of formatting that cactus isn't suspecting.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/cactus/issues/226#issuecomment-627371875, or unsubscribe https://github.com/notifications/unsubscribe-auth/APEQMAC4N4WHMJFEMOENU23RRFKQJANCNFSM4MTEA72Q .
-- Danielle H Drabeck M.Sc.
PhD Student Department of Ecology, Evolution, and BehaviorUniversity of Minnesota
Drabe004@umn.edu Danielle.Drabeck@gmail.com
“I do not know what I may appear to the world, but to myself I seem to have been only like a boy playing on the sea-shore, and diverting myself now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me.” ― Issac Newton
-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/ComparativeGenomicsToolkit/cactus/issues/226#issuecomment-627485114Hi Glenn and Mark,
Thanks so much for the quick reply! Hmm ok in previous iterations we've removed the character with sed -e '/>/s/_/ /' -- but I see in the example files that > is at the beginning of each assembly name. Can I just confirm the allowed characters in the cactus code for ammebly/scaffold names? I suspect this may be the issue with formatting.
As a second note-- wondering if there is any pre-packaged code that converts ensebll/ncbi genome files to the required header format for cactus? No problem to make our own, just asking in case someone already has this available.
Thanks again for the help!
Best,
~Danielle
On Tue, May 12, 2020 at 9:13 AM Glenn Hickey notifications@github.com wrote:
Somehow cactus isn't able to read the input Fastas. From up the log you have stuff like
Before preprocessing, got assembly stats for genome Astyanax_mexicanus_pachon: Total-sequences: 0 Total-length: 0 Proportion-repeat-masked: -nan ProportionNs: -nan Total-Ns: 0 N50: 0 Median-sequence-length: 0 Max-sequence-length: 0 Min-sequence-length: 0
It is treating all the inputs as empty, which eventually trips the exception. They must either not be valid fasta files, or have some kind of formatting that cactus isn't suspecting.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/cactus/issues/226# issuecomment-627371875, or unsubscribe https://github.com/notifications/unsubscribe-auth/ APEQMAC4N4WHMJFEMOENU23RRFKQJANCNFSM4MTEA72Q .
-- Danielle H Drabeck M.Sc.
PhD Student Department of Ecology, Evolution, and BehaviorUniversity of Minnesota
Drabe004@umn.edu Danielle.Drabeck@gmail.com
“I do not know what I may appear to the world, but to myself I seem to have been only like a boy playing on the sea-shore, and diverting myself now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me.” ― Issac Newton
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.*
ah, ok I think I misconstrued that then and removed the > from the files completely I'll re-edit and run and perhaps if I get the same error I'll compress and upload my edited .fasta files to see what may be causing further issues...
Thanks for the quick reply!
Best,
~Danielle
On Tue, May 12, 2020 at 12:47 PM Glenn Hickey notifications@github.com wrote:
Your fasta description lines must begin with >. We should provide an explicit error message right away if that's not the case (#237 https://github.com/ComparativeGenomicsToolkit/cactus/issues/237).
By default, it looks like spaces and tabs https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/src/cactus/preprocessor/checkUniqueHeaders.py#L10-L12 are not allowed, nor are alphanumeric characters https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/src/cactus/preprocessor/checkUniqueHeaders.py#L19-L21 that aren't _ - : . You should get a reasonable error message for these cases though
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/cactus/issues/226#issuecomment-627494232, or unsubscribe https://github.com/notifications/unsubscribe-auth/APEQMAFMXZ7RYO4O4DARANLRRGDUDANCNFSM4MTEA72Q .
-- Danielle H Drabeck M.Sc.
PhD Student Department of Ecology, Evolution, and BehaviorUniversity of Minnesota
Drabe004@umn.edu Danielle.Drabeck@gmail.com
“I do not know what I may appear to the world, but to myself I seem to have been only like a boy playing on the sea-shore, and diverting myself now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me.” ― Issac Newton
OK So I have run this again with genomes edited with the simple sed scripts: sed -e 's/ //g' GCF_001660625.1_IpCoco_1.2_genomic.fna >IPCoco_fixed.fasta sed -e 's/[ \t]*//' IPCoco_fixed.fasta >IPCoco_fixed2.fasta
The job ran for about 500hour and then quit-- debug attached. It seems there is still a header issue with genomes downloaded from NCBI: The offending header: OMKO01000001.1Typhlichthyssubterraneusgenomeassembly,contig:scf7180003279999,wholegenomeshotgunsequence The offending header: NC_030416.1IctaluruspunctatusbreedUSDA103chromosome1,IpCoco_1.2,wholegenomeshotgunsequence The offending header: NW_015623055.1SinocyclocheilusrhinocerousisolateXijiaounplacedgenomicscaffold,SAMN03320098_v1.1scaffold2266,wholegenomeshotgunsequence
However, the fasta was clearly readable: got assembly stats for genome Sinocyclocheilus_rhinocerous: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmpwt32cdcb/70cd2790-f3a0-49cd-bace-5a2e0b07a77e/tmpcd4swo_d.tmp Total-sequences: 164173 Total-length: 1655786410 Proportion-repeat-masked: 0.433866 ProportionNs: 0.081114 Total-Ns: 134307852 N50: 945738 Median-sequence-length: 478 Max-sequence-length: 5035720 Min-sequence-length: 200
got assembly stats for genome Typhlichthys_subterraneus: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmpmc2mu2sa/f2590e05-337e-4647-801f-92c5eeb0a35e/tmp56wbyr9u.tmp Total-sequences: 84841 Total-length: 555559596 Proportion-repeat-masked: 0.304814 ProportionNs: 0.001280 Total-Ns: 711086 N50: 9654 Median-sequence-length: 4320 Max-sequence-length: 103791 Min-sequence-length: 64
got assembly stats for genome Ictalurus_punctatus: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmp7j9z7knc/4d197236-4391-45e2-906e-27a03737ec08/tmp4596k045.tmp Total-sequences: 9341 Total-length: 783274721 Proportion-repeat-masked: 0.345783 ProportionNs: 0.014531 Total-Ns: 11381915 N50: 27425808 Median-sequence-length: 1066 Max-sequence-length: 37510255 Min-sequence-length: 500
got assembly stats for genome Astyanax_mexicanus_pachon: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmpnf8s_nwc/6abdf26e-2531-4ed9-9322-be599bbd6e50/tmp8o4urbj6.tmp Total-sequences: 10735 Total-length: 1191242572 Proportion-repeat-masked: 0.190564 ProportionNs: 0.190564 Total-Ns: 227007640 N50: 1775308 Median-sequence-length: 3230 Max-sequence-length: 9823298 Min-sequence-length: 876
got assembly stats for genome Astyanax_mexicanus: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmpequb2qrb/2ed1c4fd-9117-4ee8-8052-9d758906cc28/tmpi6yb34gl.tmp Total-sequences: 2415 Total-length: 1335239194 Proportion-repeat-masked: 0.032685 ProportionNs: 0.032685 Total-Ns: 43642769 N50: 35377769 Median-sequence-length: 47283 Max-sequence-length: 74127438 Min-sequence-length: 1866
I'm attaching the debug here ---I will share the genome files via google drive. debug-log.txt
HI All, Just updated my ticket--- seems like there is still a header issue-- Link to one of the genomes with issues is here: https://drive.google.com/file/d/1Muu8M0tTTTUD0DjIwCu3Sm5QZ_5u0ZY-/view?usp=sharing
Thanks!
~Danielle
On Tue, May 12, 2020 at 1:18 PM Danielle Drabeck drabe004@umn.edu wrote:
ah, ok I think I misconstrued that then and removed the > from the files completely I'll re-edit and run and perhaps if I get the same error I'll compress and upload my edited .fasta files to see what may be causing further issues...
Thanks for the quick reply!
Best,
~Danielle
On Tue, May 12, 2020 at 12:47 PM Glenn Hickey notifications@github.com wrote:
Your fasta description lines must begin with >. We should provide an explicit error message right away if that's not the case (#237 https://github.com/ComparativeGenomicsToolkit/cactus/issues/237).
By default, it looks like spaces and tabs https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/src/cactus/preprocessor/checkUniqueHeaders.py#L10-L12 are not allowed, nor are alphanumeric characters https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/src/cactus/preprocessor/checkUniqueHeaders.py#L19-L21 that aren't _ - : . You should get a reasonable error message for these cases though
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/cactus/issues/226#issuecomment-627494232, or unsubscribe https://github.com/notifications/unsubscribe-auth/APEQMAFMXZ7RYO4O4DARANLRRGDUDANCNFSM4MTEA72Q .
-- Danielle H Drabeck M.Sc.
PhD Student Department of Ecology, Evolution, and BehaviorUniversity of Minnesota
Drabe004@umn.edu Danielle.Drabeck@gmail.com
“I do not know what I may appear to the world, but to myself I seem to have been only like a boy playing on the sea-shore, and diverting myself now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me.” ― Issac Newton
-- Danielle H Drabeck M.Sc.
PhD Student Department of Ecology, Evolution, and BehaviorUniversity of Minnesota
Drabe004@umn.edu Danielle.Drabeck@gmail.com
“I do not know what I may appear to the world, but to myself I seem to have been only like a boy playing on the sea-shore, and diverting myself now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me.” ― Issac Newton
could you add the URLs for the NCBI fasta files to this ticket.
drabe004 notifications@github.com writes:
OK So I have run this again with genomes edited with the simple sed scripts: sed -e 's/ //g' GCF_001660625.1_IpCoco_1.2_genomic.fna >IPCoco_fixed.fasta sed -e 's/[ \t]*//' IPCoco_fixed.fasta >IPCoco_fixed2.fasta
The job ran for about 500hour and then quit-- debug attached. It seems there is still a header issue with genomes downloaded from NCBI: The offending header: OMKO01000001.1Typhlichthyssubterraneusgenomeassembly,contig:scf7180003279999,wholegenomeshotgunsequence The offending header: NC_030416.1IctaluruspunctatusbreedUSDA103chromosome1,IpCoco_1.2,wholegenomeshotgunsequence The offending header: NW_015623055.1SinocyclocheilusrhinocerousisolateXijiaounplacedgenomicscaffold,SAMN03320098_v1.1scaffold2266,wholegenomeshotgunsequence
However, the fasta was clearly readable: got assembly stats for genome Sinocyclocheilus_rhinocerous: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmpwt32cdcb/70cd2790-f3a0-49cd-bace-5a2e0b07a77e/tmpcd4swo_d.tmp Total-sequences: 164173 Total-length: 1655786410 Proportion-repeat-masked: 0.433866 ProportionNs: 0.081114 Total-Ns: 134307852 N50: 945738 Median-sequence-length: 478 Max-sequence-length: 5035720 Min-sequence-length: 200
got assembly stats for genome Typhlichthys_subterraneus: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmpmc2mu2sa/f2590e05-337e-4647-801f-92c5eeb0a35e/tmp56wbyr9u.tmp Total-sequences: 84841 Total-length: 555559596 Proportion-repeat-masked: 0.304814 ProportionNs: 0.001280 Total-Ns: 711086 N50: 9654 Median-sequence-length: 4320 Max-sequence-length: 103791 Min-sequence-length: 64
got assembly stats for genome Ictalurus_punctatus: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmp7j9z7knc/4d197236-4391-45e2-906e-27a03737ec08/tmp4596k045.tmp Total-sequences: 9341 Total-length: 783274721 Proportion-repeat-masked: 0.345783 ProportionNs: 0.014531 Total-Ns: 11381915 N50: 27425808 Median-sequence-length: 1066 Max-sequence-length: 37510255 Min-sequence-length: 500
got assembly stats for genome Astyanax_mexicanus_pachon: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmpnf8s_nwc/6abdf26e-2531-4ed9-9322-be599bbd6e50/tmp8o4urbj6.tmp Total-sequences: 10735 Total-length: 1191242572 Proportion-repeat-masked: 0.190564 ProportionNs: 0.190564 Total-Ns: 227007640 N50: 1775308 Median-sequence-length: 3230 Max-sequence-length: 9823298 Min-sequence-length: 876
got assembly stats for genome Astyanax_mexicanus: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmpequb2qrb/2ed1c4fd-9117-4ee8-8052-9d758906cc28/tmpi6yb34gl.tmp Total-sequences: 2415 Total-length: 1335239194 Proportion-repeat-masked: 0.032685 ProportionNs: 0.032685 Total-Ns: 43642769 N50: 35377769 Median-sequence-length: 47283 Max-sequence-length: 74127438 Min-sequence-length: 1866
I'm attaching the debug here ---I will share the genome files via google drive. debug-log.txt
-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/ComparativeGenomicsToolkit/cactus/issues/226#issuecomment-628767969 OK So I have run this again with genomes edited with the simple sed scripts: sed -e 's/ //g' GCF_001660625.1_IpCoco_1.2_genomic.fna >IPCoco_fixed.fasta sed -e 's/[ \t]*//' IPCoco_fixed.fasta >IPCoco_fixed2.fasta
The job ran for about 500hour and then quit-- debug attached. It seems there is still a header issue with genomes downloaded from NCBI: The offending header: OMKO01000001.1Typhlichthyssubterraneusgenomeassembly,contig:scf7180003279999,wholegenomeshotgunsequence The offending header: NC_030416.1IctaluruspunctatusbreedUSDA103chromosome1,IpCoco_1.2,wholegenomeshotgunsequence The offending header: NW_015623055.1SinocyclocheilusrhinocerousisolateXijiaounplacedgenomicscaffold,SAMN03320098_v1.1scaffold2266,wholegenomeshotgunsequence
However, the fasta was clearly readable: got assembly stats for genome Sinocyclocheilus_rhinocerous: Input-sample: / scratch.global/drabe004/ node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/ tmpwt32cdcb/70cd2790-f3a0-49cd-bace-5a2e0b07a77e/tmpcd4swo_d.tmp Total-sequences: 164173 Total-length: 1655786410 Proportion-repeat-masked: 0.433866 ProportionNs: 0.081114 Total-Ns: 134307852 N50: 945738 Median-sequence-length: 478 Max-sequence-length: 5035720 Min-sequence-length: 200
got assembly stats for genome Typhlichthys_subterraneus: Input-sample: / scratch.global/drabe004/ node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/ tmpmc2mu2sa/f2590e05-337e-4647-801f-92c5eeb0a35e/tmp56wbyr9u.tmp Total-sequences: 84841 Total-length: 555559596 Proportion-repeat-masked: 0.304814 ProportionNs: 0.001280 Total-Ns: 711086 N50: 9654 Median-sequence-length: 4320 Max-sequence-length: 103791 Min-sequence-length: 64
got assembly stats for genome Ictalurus_punctatus: Input-sample: / scratch.global/drabe004/ node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/ tmp7j9z7knc/4d197236-4391-45e2-906e-27a03737ec08/tmp4596k045.tmp Total-sequences: 9341 Total-length: 783274721 Proportion-repeat-masked: 0.345783 ProportionNs: 0.014531 Total-Ns: 11381915 N50: 27425808 Median-sequence-length: 1066 Max-sequence-length: 37510255 Min-sequence-length: 500
got assembly stats for genome Astyanax_mexicanus_pachon: Input-sample: / scratch.global/drabe004/ node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/ tmpnf8s_nwc/6abdf26e-2531-4ed9-9322-be599bbd6e50/tmp8o4urbj6.tmp Total-sequences: 10735 Total-length: 1191242572 Proportion-repeat-masked: 0.190564 ProportionNs: 0.190564 Total-Ns: 227007640 N50: 1775308 Median-sequence-length: 3230 Max-sequence-length: 9823298 Min-sequence-length: 876
got assembly stats for genome Astyanax_mexicanus: Input-sample: /scratch.global /drabe004/ node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/ tmpequb2qrb/2ed1c4fd-9117-4ee8-8052-9d758906cc28/tmpi6yb34gl.tmp Total-sequences: 2415 Total-length: 1335239194 Proportion-repeat-masked: 0.032685 ProportionNs: 0.032685 Total-Ns: 43642769 N50: 35377769 Median-sequence-length: 47283 Max-sequence-length: 74127438 Min-sequence-length: 1866
I'm attaching the debug here ---I will share the genome files via google drive. debug-log.txt
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.*
sure thing!
Sinocyclocheilus_rhinocerous https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/515/625/GCF_001515625.1_SAMN03320098_v1.1/GCF_001515625.1_SAMN03320098_v1.1_genomic.fna.gz
Typhlichthys_subterraneus https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/302/405/GCA_900302405.1_ASM90030240v1/GCA_900302405.1_ASM90030240v1_genomic.fna.gz
Ictalurus_punctatus https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/660/625/GCF_001660625.1_IpCoco_1.2/GCF_001660625.1_IpCoco_1.2_genomic.fna.gz
This is a valid fasta file, although the ids you have generated are going to be painful to use in the long run.
What was the problem with the original sed command I sent?
Danielle Drabeck drabe004@umn.edu writes:
HI All, Just updated my ticket--- seems like there is still a header issue-- Link to one of the genomes with issues is here: https://drive.google.com/file/d/1Muu8M0tTTTUD0DjIwCu3Sm5QZ_5u0ZY-/view?usp=sharing
Thanks!
~Danielle
On Tue, May 12, 2020 at 1:18 PM Danielle Drabeck drabe004@umn.edu wrote:
ah, ok I think I misconstrued that then and removed the > from the files completely I'll re-edit and run and perhaps if I get the same error I'll compress and upload my edited .fasta files to see what may be causing further issues...
Thanks for the quick reply!
Best,
~Danielle
On Tue, May 12, 2020 at 12:47 PM Glenn Hickey notifications@github.com wrote:
Your fasta description lines must begin with >. We should provide an explicit error message right away if that's not the case (#237 https://github.com/ComparativeGenomicsToolkit/cactus/issues/237).
By default, it looks like spaces and tabs https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/src/cactus/preprocessor/checkUniqueHeaders.py#L10-L12 are not allowed, nor are alphanumeric characters https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/src/cactus/preprocessor/checkUniqueHeaders.py#L19-L21 that aren't _ - : . You should get a reasonable error message for these cases though
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/cactus/issues/226#issuecomment-627494232, or unsubscribe https://github.com/notifications/unsubscribe-auth/APEQMAFMXZ7RYO4O4DARANLRRGDUDANCNFSM4MTEA72Q .
-- Danielle H Drabeck M.Sc.
PhD Student Department of Ecology, Evolution, and BehaviorUniversity of Minnesota
Drabe004@umn.edu Danielle.Drabeck@gmail.com
“I do not know what I may appear to the world, but to myself I seem to have been only like a boy playing on the sea-shore, and diverting myself now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me.” ― Issac Newton
-- Danielle H Drabeck M.Sc.
PhD Student Department of Ecology, Evolution, and BehaviorUniversity of Minnesota
Drabe004@umn.edu Danielle.Drabeck@gmail.com
“I do not know what I may appear to the world, but to myself I seem to have been only like a boy playing on the sea-shore, and diverting myself now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me.” ― Issac Newton HI All, Just updated my ticket--- seems like there is still a header issue-- Link to one of the genomes with issues is here: https://drive.google.com/file/d/1Muu8M0tTTTUD0DjIwCu3Sm5QZ_5u0ZY-/view?usp= sharing
Thanks!
~Danielle
On Tue, May 12, 2020 at 1:18 PM Danielle Drabeck drabe004@umn.edu wrote:
ah, ok I think I misconstrued that then and removed the > from the files completely I'll re-edit and run and perhaps if I get the same error I'll compress and upload my edited .fasta files to see what may be causing further issues... Thanks for the quick reply! Best, ~Danielle On Tue, May 12, 2020 at 12:47 PM Glenn Hickey <notifications@github.com> wrote: Your fasta description lines must begin with >. We should provide an explicit error message right away if that's not the case (#237). By default, it looks like spaces and tabs are not allowed, nor are alphanumeric characters that aren't _ - : . You should get a reasonable error message for these cases though — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.* -- Danielle H Drabeck M.Sc. PhD Student Department of Ecology, Evolution, and Behavior University of Minnesota Drabe004@umn.edu Danielle.Drabeck@gmail.com ___________________________________________________________________________ “I do not know what I may appear to the world, but to myself I seem to have been only like a boy playing on the sea-shore, and diverting myself now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me.” ― Issac Newton
-- Danielle H Drabeck M.Sc. PhD Student Department of Ecology, Evolution, and Behavior University of Minnesota
Drabe004@umn.edu Danielle.Drabeck@gmail.com
“I do not know what I may appear to the world, but to myself I seem to have been only like a boy playing on the sea-shore, and diverting myself now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me.” ― Issac Newton
I see: RuntimeError: An invalid character was found in the first word of a fasta header. Acceptable characters for headers in an assembly hub include alphanumeric characters plus '_', '-', ':', and '.'. Please modify your headers to eliminate other characters. The offending header: OMKO01000001.1Typhlichthyssubterraneusgenomeassembly,contig:scf7180003279999,wholegenomeshotgunsequence
drabe004 notifications@github.com writes:
OK So I have run this again with genomes edited with the simple sed scripts: sed -e 's/ //g' GCF_001660625.1_IpCoco_1.2_genomic.fna >IPCoco_fixed.fasta sed -e 's/[ \t]*//' IPCoco_fixed.fasta >IPCoco_fixed2.fasta
The job ran for about 500hour and then quit-- debug attached. It seems there is still a header issue with genomes downloaded from NCBI: The offending header: OMKO01000001.1Typhlichthyssubterraneusgenomeassembly,contig:scf7180003279999,wholegenomeshotgunsequence The offending header: NC_030416.1IctaluruspunctatusbreedUSDA103chromosome1,IpCoco_1.2,wholegenomeshotgunsequence The offending header: NW_015623055.1SinocyclocheilusrhinocerousisolateXijiaounplacedgenomicscaffold,SAMN03320098_v1.1scaffold2266,wholegenomeshotgunsequence
However, the fasta was clearly readable: got assembly stats for genome Sinocyclocheilus_rhinocerous: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmpwt32cdcb/70cd2790-f3a0-49cd-bace-5a2e0b07a77e/tmpcd4swo_d.tmp Total-sequences: 164173 Total-length: 1655786410 Proportion-repeat-masked: 0.433866 ProportionNs: 0.081114 Total-Ns: 134307852 N50: 945738 Median-sequence-length: 478 Max-sequence-length: 5035720 Min-sequence-length: 200
got assembly stats for genome Typhlichthys_subterraneus: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmpmc2mu2sa/f2590e05-337e-4647-801f-92c5eeb0a35e/tmp56wbyr9u.tmp Total-sequences: 84841 Total-length: 555559596 Proportion-repeat-masked: 0.304814 ProportionNs: 0.001280 Total-Ns: 711086 N50: 9654 Median-sequence-length: 4320 Max-sequence-length: 103791 Min-sequence-length: 64
got assembly stats for genome Ictalurus_punctatus: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmp7j9z7knc/4d197236-4391-45e2-906e-27a03737ec08/tmp4596k045.tmp Total-sequences: 9341 Total-length: 783274721 Proportion-repeat-masked: 0.345783 ProportionNs: 0.014531 Total-Ns: 11381915 N50: 27425808 Median-sequence-length: 1066 Max-sequence-length: 37510255 Min-sequence-length: 500
got assembly stats for genome Astyanax_mexicanus_pachon: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmpnf8s_nwc/6abdf26e-2531-4ed9-9322-be599bbd6e50/tmp8o4urbj6.tmp Total-sequences: 10735 Total-length: 1191242572 Proportion-repeat-masked: 0.190564 ProportionNs: 0.190564 Total-Ns: 227007640 N50: 1775308 Median-sequence-length: 3230 Max-sequence-length: 9823298 Min-sequence-length: 876
got assembly stats for genome Astyanax_mexicanus: Input-sample: /scratch.global/drabe004/node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/tmpequb2qrb/2ed1c4fd-9117-4ee8-8052-9d758906cc28/tmpi6yb34gl.tmp Total-sequences: 2415 Total-length: 1335239194 Proportion-repeat-masked: 0.032685 ProportionNs: 0.032685 Total-Ns: 43642769 N50: 35377769 Median-sequence-length: 47283 Max-sequence-length: 74127438 Min-sequence-length: 1866
I'm attaching the debug here ---I will share the genome files via google drive. debug-log.txt
-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/ComparativeGenomicsToolkit/cactus/issues/226#issuecomment-628767969 OK So I have run this again with genomes edited with the simple sed scripts: sed -e 's/ //g' GCF_001660625.1_IpCoco_1.2_genomic.fna >IPCoco_fixed.fasta sed -e 's/[ \t]*//' IPCoco_fixed.fasta >IPCoco_fixed2.fasta
The job ran for about 500hour and then quit-- debug attached. It seems there is still a header issue with genomes downloaded from NCBI: The offending header: OMKO01000001.1Typhlichthyssubterraneusgenomeassembly,contig:scf7180003279999,wholegenomeshotgunsequence The offending header: NC_030416.1IctaluruspunctatusbreedUSDA103chromosome1,IpCoco_1.2,wholegenomeshotgunsequence The offending header: NW_015623055.1SinocyclocheilusrhinocerousisolateXijiaounplacedgenomicscaffold,SAMN03320098_v1.1scaffold2266,wholegenomeshotgunsequence
However, the fasta was clearly readable: got assembly stats for genome Sinocyclocheilus_rhinocerous: Input-sample: / scratch.global/drabe004/ node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/ tmpwt32cdcb/70cd2790-f3a0-49cd-bace-5a2e0b07a77e/tmpcd4swo_d.tmp Total-sequences: 164173 Total-length: 1655786410 Proportion-repeat-masked: 0.433866 ProportionNs: 0.081114 Total-Ns: 134307852 N50: 945738 Median-sequence-length: 478 Max-sequence-length: 5035720 Min-sequence-length: 200
got assembly stats for genome Typhlichthys_subterraneus: Input-sample: / scratch.global/drabe004/ node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/ tmpmc2mu2sa/f2590e05-337e-4647-801f-92c5eeb0a35e/tmp56wbyr9u.tmp Total-sequences: 84841 Total-length: 555559596 Proportion-repeat-masked: 0.304814 ProportionNs: 0.001280 Total-Ns: 711086 N50: 9654 Median-sequence-length: 4320 Max-sequence-length: 103791 Min-sequence-length: 64
got assembly stats for genome Ictalurus_punctatus: Input-sample: / scratch.global/drabe004/ node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/ tmp7j9z7knc/4d197236-4391-45e2-906e-27a03737ec08/tmp4596k045.tmp Total-sequences: 9341 Total-length: 783274721 Proportion-repeat-masked: 0.345783 ProportionNs: 0.014531 Total-Ns: 11381915 N50: 27425808 Median-sequence-length: 1066 Max-sequence-length: 37510255 Min-sequence-length: 500
got assembly stats for genome Astyanax_mexicanus_pachon: Input-sample: / scratch.global/drabe004/ node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/ tmpnf8s_nwc/6abdf26e-2531-4ed9-9322-be599bbd6e50/tmp8o4urbj6.tmp Total-sequences: 10735 Total-length: 1191242572 Proportion-repeat-masked: 0.190564 ProportionNs: 0.190564 Total-Ns: 227007640 N50: 1775308 Median-sequence-length: 3230 Max-sequence-length: 9823298 Min-sequence-length: 876
got assembly stats for genome Astyanax_mexicanus: Input-sample: /scratch.global /drabe004/ node-63a14ec9-80a0-4498-b52f-4263391e3833-d71042ea-c1fd-45d2-91f2-62bd2deadcce/ tmpequb2qrb/2ed1c4fd-9117-4ee8-8052-9d758906cc28/tmpi6yb34gl.tmp Total-sequences: 2415 Total-length: 1335239194 Proportion-repeat-masked: 0.032685 ProportionNs: 0.032685 Total-Ns: 43642769 N50: 35377769 Median-sequence-length: 47283 Max-sequence-length: 74127438 Min-sequence-length: 1866
I'm attaching the debug here ---I will share the genome files via google drive. debug-log.txt
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.*
OH!!! cactus is merging the comment in the FASTA header with the id, creating a huge mess. This is very wrong.
So the original ids in the file are wonderful
OMKO01000001.1 Typhlichthys subterraneus genome assembly, contig: scf7180003279999, whole genome shotgun sequence
The id is OMKO01000001.1, then cactus corrupts it, then it tells you that it is invalid.
If you just do : sed -e '/>/s/ .*$//' GCA_900302405.1_ASM90030240v1_genomic.fna > GCA_900302405.1_ASM90030240v1_genomic.clean.fa
it will create fasta files that cactus will not corrupt..
OK awesome! I will give this a shot and re-run!
Running this on a cluster with a virtual env via an interactive session:
Getting this error which kills the job:
Traceback (most recent call last): File "/home/mcgaughs/drabe004/.conda/envs/cactus_deps2/bin/cactus", line 11, in
load_entry_point('Cactus==1.0', 'console_scripts', 'cactus')()
File "/home/mcgaughs/drabe004/.conda/envs/cactus_deps2/lib/python3.6/site-packages/cactus/progressive/cactus_progressive.py", line 403, in main
runCactusProgressive(options)
File "/home/mcgaughs/drabe004/.conda/envs/cactus_deps2/lib/python3.6/site-packages/cactus/progressive/cactus_progressive.py", line 451, in runCactusProgressive
halID = toil.start(RunCactusPreprocessorThenProgressiveDown(options, project, memory=configWrapper.getDefaultMemory()))
File "/home/mcgaughs/drabe004/.conda/envs/cactus_deps2/lib/python3.6/site-packages/toil/common.py", line 811, in start
return self._runMainLoop(rootJobGraph)
File "/home/mcgaughs/drabe004/.conda/envs/cactus_deps2/lib/python3.6/site-packages/toil/common.py", line 1102, in _runMainLoop
jobCache=self._jobCache).run()
File "/home/mcgaughs/drabe004/.conda/envs/cactus_deps2/lib/python3.6/site-packages/toil/leader.py", line 223, in run
self.innerLoop()
File "/home/mcgaughs/drabe004/.conda/envs/cactus_deps2/lib/python3.6/site-packages/toil/leader.py", line 561, in innerLoop
self.statsAndLogging.check()
File "/home/mcgaughs/drabe004/.conda/envs/cactus_deps2/lib/python3.6/site-packages/toil/statsAndLogging.py", line 203, in check
raise RuntimeError("Stats and logging thread has quit")
RuntimeError: Stats and logging thread has quit