chanzuckerberg / shasta

[MOVED] Moved to paoloshasta/shasta. De novo assembly from Oxford Nanopore reads
Other
270 stars 57 forks source link

A standard exception occurred in thread #45

Closed asdcid closed 4 years ago

asdcid commented 5 years ago

Hi,

I tried to run Shasta with:

#!/bin/bash
set -e
SCRIPT='/home/raymond/devel/shasta/shasta-Linux-0.1.0'
inputFile='canu_corr150x.correctedReads.nonOtherChar.fasta'
outputFile='correct_1kb'

$SCRIPT \
    --input $inputFile \
    --output $outputFile

But it raised this error:

Shasta Release 0.1.0
2019-Aug-15 00:52:11.143336
This is the static executable for the Shasta assembler. It provides limited Shasta functionality at but has no dependencies and requires no installation.

Default values of assembly parameters are optimized for an assembly at coverage 60x. If your data have significantly different coverage, some changes in assembly parameters may be necessary to get good results.

For more information about the Shasta assembler, see
https://github.com/chanzuckerberg/shasta

Complete documentation for the latest version of Shasta is available here:
https://chanzuckerberg.github.io/shasta

Options in use:
Input FASTA files: canu_corr150x.correctedReads.nonOtherChar.fasta
outputDirectory = correct_1kb
memoryMode = anonymous
memoryBacking = 4K

[Reads]
minReadLength = 10000
palindromicReads.maxSkip = 100
palindromicReads.maxMarkerFrequency = 10
palindromicReads.alignedFractionThreshold = 0.1
palindromicReads.nearDiagonalFractionThreshold = 0.1
palindromicReads.deltaThreshold = 100

[Kmers]
k = 10
probability = 0.1

[MinHash]
m = 4
hashFraction = 0.01
minHashIterationCount = 10
maxBucketSize = 10
minFrequency = 2

[Align]
maxSkip = 30
maxMarkerFrequency = 10
minAlignedMarkerCount = 100
maxTrim = 30

[ReadGraph]
maxAlignmentCount = 6
minComponentSize = 100
maxChimericReadDistance = 2

[MarkerGraph]
minCoverage = 10
maxCoverage = 100
lowCoverageThreshold = 0
highCoverageThreshold = 256
maxDistance = 30
edgeMarkerSkipThreshold = 100
pruneIterationCount = 6
simplifyMaxLength = 10,100,1000

[Assembly]
markerGraphEdgeLengthThresholdForConsensus = 1000
consensusCaller = SimpleConsensusCaller
useMarginPhase = False
storeCoverageData = False

Shasta Release 0.1.0
2019-Aug-15 00:52:11.143998 Loading reads from /data/raymond/work/Eucalyptus_pauciflora/genome/bin/genome/assembly/shasta/canu_corr150x.correctedReads.nonOtherChar.fasta.
Input file block size: 2147483648 bytes.
Using 1 threads for reading and 56 threads for processing.
Input file size is 74766867552 bytes.
2019-Aug-15 00:52:11.156517 Reading block 0 2147483648, 2147483648 bytes.
Block read in 2.12567 s at 1.01026e+09 bytes/s.
Processing 2147471993 input characters.
A standard exception occurred in thread 42: Invalid base character 78
./run.sh: line 12: 42463 Aborted                 (core dumped) $SCRIPT --input $inputFile --output $outputFile

Do you know what caused this issue?

Many thanks, Raymond

paoloczi commented 5 years ago

The message says Invalid base character 78 and 78 is N in ASCII. This means that your reads contains no-called bases N. This is not supported by Shasta. You will have to filter your fasta files to remove any reads that contain no-called bases N.

paoloczi commented 5 years ago

Two possible courses of action for Shasta development:

  1. Improve that error message.
  2. Ignore on input reads that contain no-called bases, reporting how many reads and bases were discarded in this way.

Because of this, I will leave this issue open and mark it as enhancement.

asdcid commented 5 years ago

Thanks, paoloczi

biowackysci commented 5 years ago

Hi, I think I have a similar error I ran shasta with the following on a Slurm engine _assembly]$ shasta --input /group/pasture/Saila/Shasta_assembly/shasta_assembly_reads.fasta

The output was Shasta Release 0.1.0 2019-Aug-15 10:17:36.951600 This is the static executable for the Shasta assembler. It provides limited Shasta functionality at but has no dependencies and requires no installation.

Default values of assembly parameters are optimized for an assembly at coverage 60x. If your data have significantly different coverage, some changes in assembly parameters may be necessary to get good results.

For more information about the Shasta assembler, see https://github.com/chanzuckerberg/shasta

Complete documentation for the latest version of Shasta is available here: https://chanzuckerberg.github.io/shasta

2019-Aug-15 10:17:36.954764 Terminated after catching a runtime error exception: Output directory ShastaRun already exists. Remove it or use --output to specify a different output directory. [vici86y@dev1 Shasta_assembly]$ shasta --input /group/pasture/Saila/Shasta_assembly/shasta_assembly_reads.fasta --output ShastaRun2 Shasta Release 0.1.0 2019-Aug-15 10:18:18.287359 This is the static executable for the Shasta assembler. It provides limited Shasta functionality at but has no dependencies and requires no installation.

Default values of assembly parameters are optimized for an assembly at coverage 60x. If your data have significantly different coverage, some changes in assembly parameters may be necessary to get good results.

For more information about the Shasta assembler, see https://github.com/chanzuckerberg/shasta

Complete documentation for the latest version of Shasta is available here: https://chanzuckerberg.github.io/shasta

Options in use: Input FASTA files: /group/pasture/Saila/Shasta_assembly/shasta_assembly_reads.fasta outputDirectory = ShastaRun2 memoryMode = anonymous memoryBacking = 4K

[Reads] minReadLength = 10000 palindromicReads.maxSkip = 100 palindromicReads.maxMarkerFrequency = 10 palindromicReads.alignedFractionThreshold = 0.1 palindromicReads.nearDiagonalFractionThreshold = 0.1 palindromicReads.deltaThreshold = 100

[Kmers] k = 10 probability = 0.1

[MinHash] m = 4 hashFraction = 0.01 minHashIterationCount = 10 maxBucketSize = 10 minFrequency = 2

[Align] maxSkip = 30 maxMarkerFrequency = 10 minAlignedMarkerCount = 100 maxTrim = 30

[ReadGraph] maxAlignmentCount = 6 minComponentSize = 100 maxChimericReadDistance = 2

[MarkerGraph] minCoverage = 10 maxCoverage = 100 lowCoverageThreshold = 0 highCoverageThreshold = 256 maxDistance = 30 edgeMarkerSkipThreshold = 100 pruneIterationCount = 6 simplifyMaxLength = 10,100,1000

[Assembly] markerGraphEdgeLengthThresholdForConsensus = 1000 consensusCaller = SimpleConsensusCaller useMarginPhase = False storeCoverageData = False

Shasta Release 0.1.0 2019-Aug-15 10:18:18.412142 Loading reads from /group/pasture/Saila/Shasta_assembly/shasta_assembly_reads.fasta. Input file block size: 2147483648 bytes. Using 1 threads for reading and 48 threads for processing. Input file size is 231040715472 bytes. 2019-Aug-15 10:18:19.706959 Reading block 0 2147483648, 2147483648 bytes. Block read in 5.60592 s at 3.83074e+08 bytes/s. Processing 2147473237 input characters. A standard exception occurred in thread 19: Invalid base character 13 Aborted (core dumped)

I am not sure if its the same issue of not recognising the fasta file as I used a linux command to convert my fastq files into fasta files. The perl script that was on the website could not be cloned into our server

Is there a way I can overcome this?

Thanks S

paoloczi commented 5 years ago

This is a different problem. The Shasta assembler has a documented limitation that each of the input reads must be on a single line of the fasta file. There is an open issue on this, issue #28.

When converting your fastq to fasta, make sure to create a single line for each read. Don't write a line end every 50 or 80 bases or at all - just a line end at the end of each read.

The Shasta script to convert fastq to fasta is a Python script, not a Perl script. It should run on any Linux system with Python 3 installed. If you only have Python 2 installed, the script should still work if you change python3 to python or python2 in the first line of the script.

biowackysci commented 5 years ago

Thanks for the prompt reply. I could work with the python script. Sorry for misnaming the script. I was meaning to say Python script and wrote perl instead of that. I figured the python3 script and converted the fastq to fasta but again the issue persist like you mentioned earlier. Anyways this is what i got after I ran the script after converting the fastq files using the python script

Shasta Release 0.1.0 2019-Aug-15 11:45:36.920406 This is the static executable for the Shasta assembler. It provides limited Shasta functionality at but has no dependencies and requires no installation.

Default values of assembly parameters are optimized for an assembly at coverage 60x. If your data have significantly different coverage, some changes in assembly parameters may be necessary to get good results.

For more information about the Shasta assembler, see https://github.com/chanzuckerberg/shasta

Complete documentation for the latest version of Shasta is available here: https://chanzuckerberg.github.io/shasta

Options in use: Input FASTA files: /group/pasture/Saila/Shasta_assembly/latest_assembly_reads.fasta outputDirectory = ShastaRun memoryMode = anonymous memoryBacking = 4K

[Reads] minReadLength = 10000 palindromicReads.maxSkip = 100 palindromicReads.maxMarkerFrequency = 10 palindromicReads.alignedFractionThreshold = 0.1 palindromicReads.nearDiagonalFractionThreshold = 0.1 palindromicReads.deltaThreshold = 100

[Kmers] k = 10 probability = 0.1

[MinHash] m = 4 hashFraction = 0.01 minHashIterationCount = 10 maxBucketSize = 10 minFrequency = 2

[Align] maxSkip = 30 maxMarkerFrequency = 10 minAlignedMarkerCount = 100 maxTrim = 30

[ReadGraph] maxAlignmentCount = 6 minComponentSize = 100 maxChimericReadDistance = 2

[MarkerGraph] minCoverage = 10 maxCoverage = 100 lowCoverageThreshold = 0 highCoverageThreshold = 256 maxDistance = 30 edgeMarkerSkipThreshold = 100 pruneIterationCount = 6 simplifyMaxLength = 10,100,1000

[Assembly] markerGraphEdgeLengthThresholdForConsensus = 1000 consensusCaller = SimpleConsensusCaller useMarginPhase = False storeCoverageData = False

Shasta Release 0.1.0 2019-Aug-15 11:45:36.938486 Loading reads from /group/pasture/Saila/Shasta_assembly/latest_assembly_reads.fasta. Input file block size: 2147483648 bytes. Using 1 threads for reading and 48 threads for processing. Input file size is 231113128975 bytes. 2019-Aug-15 11:45:37.434013 Reading block 0 2147483648, 2147483648 bytes. Block read in 5.51606 s at 3.89315e+08 bytes/s. Processing 2147473237 input characters. A standard exception occurred in thread 19: Invalid base character 13 Aborted (core dumped)

Just thought will let you know the outcome Thanks again S

paoloczi commented 5 years ago

Tomorrow I will give you a specially instrumented version of the executable to diagnose this problem. Could you create a shortened version of the fasta file (for example the first 100 lines), zip it, and attach it to this issue? You can just drag and drop the zip file here.

biowackysci commented 5 years ago

okay thanks. Here we go

test_100lines.fasta.zip

S

asdcid commented 5 years ago

Hi @paoloczi , I have fixed the no-call based problem by sed -e '/^[^>]/s/[^ATGCatgc]//g' $inputFile > $outputFile.

However, now I got a memory issue. Do you think how much memory I should give to Shasta if I want to assemble a genome (~500 Mb) with 70 GB long-reads (~150x)?

Found 265 strand jump regions.
Marked 4212 read graph edges out of 21301828 total as cross-strand.
2019-Aug-15 03:36:39.736160 End flagCrossStrandReadGraphEdges.
2019-Aug-15 03:36:39.755832 Begin flagging chimeric reads, max distance 2
Using 56 threads.
2019-Aug-15 03:36:39.755912 Processing 2367343 reads.
2019-Aug-15 03:36:44.942487 Done flagging chimeric reads.
2019-Aug-15 03:36:44.951849 Flagged 131227 reads as chimeric out of 2367343 total.
Chimera rate is 0.0554322
2019-Aug-15 03:36:44.976602 Computing connected components of the read graph.
The read graph has 693959 connected components.
2019-Aug-15 03:36:46.496735 Done computing connected components of the read graph.
Processing self-complementary component 0 with 3948676 oriented reads.
2019-Aug-15 03:36:47.272691 Begin computing marker graph vertices.
Using 56 threads.
Begin processing 21301828 alignments in the read graph.
2019-Aug-15 03:39:50.852484 Disjoint set computation begins.
2019-Aug-15 04:05:41.379123 Disjoint set computation completed.
2019-Aug-15 04:05:41.379201 Finding the disjoint set that each oriented marker was assigned to.
2019-Aug-15 04:05:59.499806 Terminated after catching a runtime error exception:
Error 12 during mremap call for MemoryMapped::Vector: Cannot allocate memory
paoloczi commented 5 years ago

I have not experienced with many different genome sizes so far so it's hard to predict the memory requirements. For your situation I would expect something around 1 TB. If you are using Amazon AWS EC2 I suggest trying x1.16xlarge first (976 GB), and if that is not enough, x1.32xlarge (1952 GB).

paoloczi commented 5 years ago

Regarding @biowackysci's issue: the short test fasta file did not have any problems, so I will work on creating an instrumented version of the code to diagnose your problem.

However I noticed that your reads are very short, at least compared to the results we reported in the paper. The N50 for the 50 reads in that file is under 4 Kb (versus 40 to 60 Kb for the data in our paper). Are these nanopore reads? I have not experimented with reads this short, and working with them will at least require changing some assembly parameters (at the very least a reduction in the read length cutoff --Reads.minReadLength which is 10 Kb by default, otherwise you will lose most of your coverage). When the input issue is fixed, please file a separate issue to talk about assembly parameter changes for short reads (although "short reads" usually means something else, but for out purposes 4 Kb is short).

paoloczi commented 5 years ago

@biowackysci, I have created an instrumented executable that you can use to diagnose your problem. You can donwload it from here. This is a temporary link that will expire in a few days.

After downloading it, please make it executable, then run it using option --input to specify your fasta file. Please save the entire output (stdout) and attach it here (you may have to zip it depending on how you choose to attach it).

biowackysci commented 5 years ago

Thanks for the link. In the previous run i had added some Pacbio reads as well. But I am trying to run it again just using oxford reads (PromethION and MinION reads) and see how it goes.

biowackysci commented 5 years ago

I tried running the link. This is the output I got shasta_instrumented_output.txt

It stops after this error 2019-Aug-16 10:56:34.082833 Reading /group/pasture/Saila/Shasta_assembly/latest_assembly_reads_nocall_solved_test.fasta block 0 2147483648, 2147483648 bytes. 2019-Aug-16 10:56:35.470707 Terminated after catching a runtime error exception: Expected '>' at beginning of a block.

Not sure what is happening

paoloczi commented 5 years ago

This is a different error now. Can you check if the very first character of the file is ">"? There should be nothing in the file before the first read.

biowackysci commented 5 years ago

I head the file and this is what it has, and No I dont see a > accacaaacacactattT TGTTTTCAGTACGTGCTTCGTTCGAGATCATTTAGGTGTTTAACCGTTTCGCGTTATCGTGAAACGCTTTCGCGTTTTCGTGCGCCGCTTCCCTTGATGGAAACAATAGCATGTTCAGATCTGAGATCATGGCACTTTCGGGCCCTGGTAGTGACAAACATTAAGCGCAACAAGTTTGCAACAATATCATCAAAGTACCAATTACGGACACTAGGCTATGCCCTAACAATCTTAATATATTACATCACCAATCTCATCCAATCCCTACCATCCCTTGACTCTGGTGAAATTACTCACACATGGATGGGAAACATGGTAGTTGATGTAGAAGTCCAGCCCGGTGATAAGCGGCGATGAAGTCCTCAATTC

acaaacacactattT TACGTTTTCGGTATTGCCGTTCAGTTCGATTTCCGGGTGTTTAACCGTTTTCGCATTTATCGTGAAACGCTTTTCGCGTTTCGTGCGCCGCTTCAACTACAAAGATAATTCCGTTAATGACTCTTTAGTACGCCCTCTTTTTTTTATAGCATAGTTCTTCCTTGTTTACACACCTTCCTTGTTTCACCGAGAGACTAAAATTAAGGAATGGATAGAAAACCGGTAGAGACCGAAATCTTGTAGATTTGGTGGGAGACAAATGGTCAGATAAAAAAAATATGGAATGAAACCTCCAGCCGGAGAAAGCGAACAGTTCCTCAGCACGCGCGTAGAGATTTATTTTTTGCGAAAAGGCCAAAGGGCATACCTCCCCATCATGAAGCGTACTTTTAACATTCTTGAAAAGTCCAACAAAATTTAGGTTTATAAAAAAAGCACCAACATTTATGTCACCAAATTAATATCATTATATTCATGATAAAATAAATTTTCATCATAAGCTATTTGATTTATAAAATATTTTTTTGTGTAAACCATTAAAATTTTAAAATTGTTTTGATTTTCTAGAAAAGTTAAAAGTCTTTCATGGAGCAGAGAGTATAGTAAGTCAACCAACCCAAATTTTCAAATTACCACGATGACTCATCTCAAGGCCATATTATGCGTATGTATTGTGCTTCCACGAATCCAATGACCAACAATTTACCACCTCCCACTTATCTTGTTTGACTATGAAAACGAAAAAAATACGTAACAATAAAATATGGAATCATTGAAGATAGACCCAGCCATGCTCCACAATAAAAAAATACAGAACAGCAGAAGGTCTAAATGTTTTTACACTATTGTTTTGGCAATTGGTTGAGTTCATATAGTCTATCTGCA

ccaacaacacactattT TTGGTATGCTTCGTTCAGTTCAGAAGGTGGGTGTTTAGCGAACGCATTTATCGTTGACATTGCGCCGCTTCAGCTAGTCCTAAGCAGAGTGTTAGGATCAACCTCGTGTTGGTTTTAGGCCTTGTTTAGGATGGAAACTACGATCACCGTGTGTGGCAGCGAGCTATTCGTGAAGTAGGATGATCCGATTATGCGTTGAAAGAAACCCCAAATGGTAAATAGATCGAGTTAGCTTTATCTTGATCAAGCGGGACCACCATATATTAATTTTTTTATACGTCTGAGATCATGGAGTCGGATCGGCTCCTTGGCCCGATTCAGGACAACACGAGAAGCCGTCAGGGCTGCATTATTTATGTTCACGTGTATGCCATGCAGAAACTAGCGAGCATCTCCATCACCTTCTGACGGTATAGGTCGGGTAGCATATACATCCCGACATCCGGGCGATGAAATGCGAAACTGCGGACCGTGGCTTTCGGAACCGGGGAATATTGCGTGACCCTAAGATTGTTCCCGACTCCCTCTTTGTATTTGCGGTAGCTGCCGCCGGTGGGTTTAAAGCGCCAACACGATGCGGCGCGTCGGTCAGGACAATCTTCCCACATCAACCACATGCGCCATCTACATCTGGGGTAGCGAAGCGTCATGAGTCACCTCTGAGGATCGTAGAGGCGAGCTCAGAGAAGAAGTATGGCGAGGTCAAAGCTATCCTCGAGGCGATCTCGCCGGCTCTTTTCCTGAACTTTCCCGTTCACATGTATGATTGGGAAGGGGAT

I am not sure where is it finding a ">"

paoloczi commented 5 years ago

So Shasta is correctly complaining that the file does not begin with a ">".

How did you create this file?

biowackysci commented 5 years ago

i used a sed command to convert my fastq files to fasta files sed -e '/^[^>]/s/[^ATGCatgc]//g'

biowackysci commented 5 years ago

i also tried using an input file which looked like this

4ea0f3ed-41c5-4c66-96a1-ecd1eda89741 runid=91a294d4b29a6bd220cfe4d26b5b3364bab5c2d5 read=8 ch=241 start_time=2018-06-28T04:40:05Z TGTTTTCAGTACGTGCTTCGTTCGAGATCATTTAGGTGTTTAACCGTTTCGCGTTATCGTGAAACGCTTTCGCGTTTTCGTGCGCCGCTTCCCTTGATGGAAACAATAGCATGTTCAGATCTGAGATCATGGCACTTTCGGGCCCTGGTAGTGACAAACATTAAGCGCAACAAGTTTGCAACAATATCATCAAAGTACCAATTACGGACACTAGGCTATGCCCTAACAATCTTAATATATTACATCACCAATCTCATCCAATCCCTACCATCCCTTGACTCTGGTGAAATTACTCACACATGGATGGGAAACATGGTAGTTGATGTAGAAGTCCAGCCCGGTGATAAGCGGCGATGAAGTCCTCAATTC ae2740b1-ef0d-46d6-b190-9ff6377ca8f6 runid=91a294d4b29a6bd220cfe4d26b5b3364bab5c2d5 read=20 ch=335 start_time=2018-06-28T04:40:06Z TACGTTTTCGGTATTGCCGTTCAGTTCGATTTCCGGGTGTTTAACCGTTTTCGCATTTATCGTGAAACGCTTTTCGCGTTTCGTGCGCCGCTTCAACTACAAAGATAATTCCGTTAATGACTCTTTAGTACGCCCTCTTTTTTTTATAGCATAGTTCTTCCTTGTTTACACACCTTCCTTGTTTCACCGAGAGACTAAAATTAAGGAATGGATAGAAAACCGGTAGAGACCGAAATCTTGTAGATTTGGTGGGAGACAAATGGTCAGATAAAAAAAATATGGAATGAAACCTCCAGCCGGAGAAAGCGAACAGTTCCTCAGCACGCGCGTAGAGATTTATTTTTTGCGAAAAGGCCAAAGGGCATACCTCCCCATCATGAAGCGTACTTTTAACATTCTTGAAAAGTCCAACAAAATTTAGGTTTATAAAAAAAGCACCAACATTTATGTCACCAAATTAATATCATTATATTCATGATAAAATAAATTTTCATCATAAGCTATTTGATTTATAAAATATTTTTTTGTGTAAACCATTAAAATTTTAAAATTGTTTTGATTTTCTAGAAAAGTTAAAAGTCTTTCATGGAGCAGAGAGTATAGTAAGTCAACCAACCCAAATTTTCAAATTACCACGATGACTCATCTCAAGGCCATATTATGCGTATGTATTGTGCTTCCACGAATCCAATGACCAACAATTTACCACCTCCCACTTATCTTGTTTGACTATGAAAACGAAAAAAATACGTAACAATAAAATATGGAATCATTGAAGATAGACCCAGCCATGCTCCACAATAAAAAAATACAGAACAGCAGAAGGTCTAAATGTTTTTACACTATTGTTTTGGCAATTGGTTGAGTTCATATAGTCTATCTGCA 26dec5c5-a50d-4257-90f8-b83785845a9c runid=91a294d4b29a6bd220cfe4d26b5b3364bab5c2d5 read=44 ch=10 start_time=2018-06-28T04:40:07Z TTGGTATGCTTCGTTCAGTTCAGAAGGTGGGTGTTTAGCGAACGCATTTATCGTTGACATTGCGCCGCTTCAGCTAGTCCTAAGCAGAGTGTTAGGATCAACCTCGTGTTGGTTTTAGGCCTTGTTTAGGATGGAAACTACGATCACCGTGTGTGGCAGCGAGCTATTCGTGAAGTAGGATGATCCGATTATGCGTTGAAAGAAACCCCAAATGGTAAATAGATCGAGTTAGCTTTATCTTGATCAAGCGGGACCACCATATATTAATTTTTTTATACGTCTGAGATCATGGAGTCGGATCGGCTCCTTGGCCCGATTCAGGACAACACGAGAAGCCGTCAGGGCTGCATTATTTATGTTCACGTGTATGCCATGCAGAAACTAGCGAGCATCTCCATCACCTTCTGACGGTATAGGTCGGGTAGCATATACATCCCGACATCCGGGCGATGAAATGCGAAACTGCGGACCGTGGCTTTCGGAACCGGGGAATATTGCGTGACCCTAAGATTGTTCCCGACTCCCTCTTTGTATTTGCGGTAGCTGCCGCCGGTGGGTTTAAAGCGCCAACACGATGCGGCGCGTCGGTCAGGACAATCTTCCCACATCAACCACATGCGCCATCTACATCTGGGGTAGCGAAGCGTCATGAGTCACCTCTGAGGATCGTAGAGGCGAGCTCAGAGAAGAAGTATGGCGAGGTCAAAGCTATCCTCGAGGCGATCTCGCCGGCTCTTTTCCTGAACTTTCCCGTTCACATGTATGATTGGGAAGGGGAT fced0264-6eee-4d32-9906-62e579e6baa5 runid=91a294d4b29a6bd220cfe4d26b5b3364bab5c2d5 read=27 ch=145 start_time=2018-06-28T04:40:06Z TCGTTTTTTTTTTTTTCGGTAGCCGCTACGTTTCATTTCGGTGGGTGTTTAGCCGTTTCAAAACATCATTGAAACGCTTTCGCGTTTTCGTGCGCCGCTTCCCAATGAGTAGCATATGCGAGGTAAAGGGATTTAGGTGATGCATAGTACTCCTTGTATTCATACGGCCTCTGCTGACATTATTACGGCACATACCAAGGCAATTTGATTTCAAATTGATTTGCTTTCACAATGTTTTTCAGGATCCTCTCTTTCCTGTGAAGCTCCAACACCGCCCATTAAAAGCGGATCCTCCAACTTTAAGAATGCATACCTGGTTGAATGGTGTAATGGTAAGGCCAATAGCGGTGCAGAGTGTATGGCGGCGGCAATATGCGCCCTTGCAAGTATTACCAAGCCTGGGGGAAGCTGGCTTTGTTTCCCTTTATGACTAGCGAAGAATTCTTTTGATGCGGTTCTATAGGGTACATCCCATCCCTCTTTATAGGACCTCTTAAAAGTGCCTCCCCATCGGGTAGATGAACAGCCAAAATGCACCATCGCTCCAAAGAGCGGAGGATGTAGCTCCGGCTTGCGGGATAAGCGCTGTAATGTCACCTTCAAGAGCGTTTTACTTGGCCATTTCTGCTTGAGAGTTTTTGCTATTTCCCTAAAAGAACTTAATTCCCAACTCTTGCCGAAATCCGTTGTATTTGGCTTCACTAAATGCTGCAATACCCAATAGCTAGCTGCGGCCCAGGATGTGACAATCATTGAGTTTTTCAATCCCAGCACGTTTATTTCAGCCGGCGTGAGACAATTTTGCAATGTTGGAAGCAAATCGAGGCGAGGGAACATCACATAAGATTTTCATAAAATAATAATGAAAATGTTCTTTTTGAAGGCGCCGACGTGTACC 2207abb6-b02f-4939-805a-f33c28cb2e50 runid=91a294d4b29a6bd220cfe4d26b5b3364bab5c2d5 read=15 ch=85 start_time=2018-06-28T04:40:06Z ATCGGTATTACTTCGTTTCAGTTCAGGTGGGTGTTTTAACCGTTTTCGCATTTATCGTGGAAACGCTTTCGCGTTTTCGTGCCCGCCGCTTCCGCCGGGAGAACAGCGGCCTCGGCCCGAATCGGCAACCGATAGGCGAAGAACCGACGCTGCTCCGCTTCTCATCTGCTTGCTGCGGGTGTAAATTAGGTGCTGGCCTTGCTTGGGCCAGGGTGGGGACGGCAGGGATTGGCTGCACTCGGGCCTGTGGGTTAGAGGCTGCTGCGGGGTCGCCAAGCCATTCAGGACTCTGCAGGATGTTTCAGCTGCCGGGGCTGCGTCATGCATGCGGTGAGAGGAAGATGGGGATCGGCCCATGGGACCTTAGGCTTGGTGAGTGGGCCGTGCCCCTAAGTGGCTTATTCCCCGGTGGCCCGGCGATGAGTCGATGGTGGAGTCCTACCTATCCACGGCGAAGACTGAAGGGACGTCAGCGTTTTTAGTTTTTTCACGCAAAGTCGTATGCAAAATCCCGGAGGGCTGGTTTTTTTATTTCTTGTTTCTTGGGAGGCCCTCTGTAGCGGTTTGATGCATAGCAAATTAATGGGAATAATGCAGCTCTGGTGCCGGGACCTACAATAAAAAAAAAGGAAGCATTTAGCGGCGGGCTGTGAATCACTTCTACGTTTACTTTCCAGATGATGCAGCGTTCCACGACCGGAGAACCTCTCTTCTAACAAGCCTGTGCGCCGGCCCGGTTTTAATCCCGCCCGTGTATTTATTTAGTAAATCATTAGCCCTGGTAATCCCCGTCTAAGCGTTCCGCGATCGAACTCTCTCTCTGATAGCAGACCTATTGCTTCCGCCCGGTATTCAGTCCACCATGTATTTCGTCAAATGGTAGCCCTGGTGAAATACGTTAATAAAGTACGTTCCGCAGCGAACTACTCTCTCCCTACCCAGGCCTGGCCGTACCGCCATCCAGCTTGGTCCATACGGTGTATTGCGTGCGTGGCCCCTGGTAATGCGGCGTCTGGTGTTCACGGCCGAACGCGACTCTACTTGAACCCTGACCATTGCCGCCCGGTGGTCTTTAGTCCCCATGTATTTGTCAAATGCGTGGCCCTGGAATTAAAGCATACGGCATTCCGACCGAGACCTACTTGGGCCTGTGCATTATCATGCTGTGGTTTAGGTCCACGTGTATTTGAGGTAAAATGCATTGGCCACGGTGATGCCCAGTCACAAAGCGTTCGACCGAACTTCTCTCACGGCGAGGGCTATTTGTACTTGCCGGCGTTCAATCAGCCAATTCCCAGTCCGTCGTGGCCTGGTAGTACTTAGCGTCTAGCGTTCACGA

the output file is shasta_instrumented_my_fastafile_output.txt

This looks like the previous issue you had mentioned. If I can sort this fasta file maybe I can get somewhere ahead

S

asdcid commented 5 years ago

@biowackysci I don't think sed -e '/^[^>]/s/[^ATGCatgc]//g' <input_file> <output_file> can convert fastq to fasta, what this command does is removing any non ATCG character.

biowackysci commented 5 years ago

@asdcid thanks for that . I also used the fasta file I previously used and I got the error posted above.

paoloczi commented 5 years ago

Ok, I am bit confused. Let me see if I got it right. It seems that you have two separate problems:

  1. The file that does not have > as its first character. Shasta is correctly rejecting that file as it is invalid. I will make sure to have a better error message for that case in the next release.
  2. The file that does have > as its first character, and for which you sent me the first 100 lines the other day. Did you run the instrumented executable on this file? If so, please post the entire output. I only saw output from the instrumented executable for the file that does not have > as its first character.

Please confirm that I have it right.

For the future, I would generally prefer to use a separate GitHub issue for each distinct problem, and this issue actually was initially about reads containing no-called bases. But let's keep it the way it is and hopefully we can sort it all out.

biowackysci commented 5 years ago

hello paoloczi thanks for the reply This is the fasta file

head_output.txt

And when I used the link I got this output shasta_instrumented_my_shasta_assembly_readsfastafile_output.txt

paoloczi commented 5 years ago

The message shows a Windows-style line end: Ascii 13 followed by Ascii 10 corresponding to a Carriage Return/Line Feed sequence (CR/LF). You need to convert the file to proper Unix-style line ends. You can use Linux command dos2unix for that.

This occurred at byte offset 2013261637 in the file while processing input read with id 9d2e4f61-ec71-4eb6-8eca-965522d9c85a. This is not necessarily the first read where the error occurred because the reads are processed in an unpredictable order due to multithreading.

I will add to the wish list for improvements in the loading of reads the ability to also read files containing Windows-style line ends.

paoloczi commented 5 years ago

So, to summarize this issue, two topics were covered, resulting in two desired enhancements:

biowackysci commented 5 years ago

Thanks again, Okay now I have sorted the ASCII issue. I was able to move forward to the next error. SO i ran the script again and this is the output output_awk_test.txt

the error reads "2019-Aug-19 15:21:22.922939 A runtime error occurred in thread 36: The sequence of each read must be on a single line of the input fasta file. Read c9c2c4a5-13e0-45fd-be61-2c9a32ac4c98 spans more than one line near file offset 83216460077"

So i guess there is some error with this specific read ?

Thanks for leading me till here. S

paoloczi commented 5 years ago

As I explained in one of my comments on this issue above, Shasta has a documented restriction that the base sequence of each read in an input Fasta file must be on a single line. Reads whose sequence spans more than one line are not allowed. You will need to preprocess the file to remove all line ends embedded in the sequence of each read.

Alternatively, if your reads come from a Fastq file, you can process that file using shasta/scripts/FastqToFasta.py. That script will create a Fasta file with the required format.

Hopefully these restrictions will go away at some point, but for now we have to live with them.

paoloczi commented 4 years ago

The latest code supports multiline reads, reads containing invalid bases (those reads are discarded), and Windows line ends.