Aborted quorum_error_correct_reads

alekseyzimin / masurca

GNU General Public License v3.0

246 stars 35 forks source link

Aborted quorum_error_correct_reads #33

Open BenjaminGuinet opened 6 years ago

BenjaminGuinet commented 6 years ago

Hi, I'm actually using MaSuRCA-3.2.6 to assemble my genome and a ran the fallowing script:

    #PBS -S /bin/bash
    #PBS -l nodes=1:ppn=8:bigmem,mem=100gb
    #PBS -e /pandata/ACG-0006_0027/LOGS/ACG-006_assembly.error
    #PBS -o /pandata/ACG-0006_0027/LOGS/ACG-006_assembly.out
    #PBS -N ACG-006
    #PBS -q q1week

    DATA
    PE= pe 150 22 /pandata/LEPIWASP/ACG-0006_0027/frag_1.fastq /pandata/LEPIWASP/ACG-0006_0027/frag_2.fastq

    END

    PARAMETERS
    #set this to 1 if your Illumina jumping library reads are shorter than 100bp
    EXTEND_JUMP_READS=0
    #this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content
    GRAPH_KMER_SIZE = auto
    #set this to 1 for all Illumina-only assemblies
    #set this to 1 if you have less than 20x long reads (454, Sanger, Pacbio) and less than 50x CLONE coverage by Illumina, Sanger or 454 mate pairs
    #otherwise keep at 0
    USE_LINKING_MATES = 0
    #specifies whether to run mega-reads correction on the grid
    USE_GRID=0
    #specifies queue to use when running on the grid MANDATORY
    GRID_QUEUE=all.q
    #batch size in the amount of long read sequence for each batch on the grid
    GRID_BATCH_SIZE=300000000
    #coverage by the longest Long reads to use
    LHE_COVERAGE=30
    #this parameter is useful if you have too many Illumina jumping library mates. Typically set it to 60 for bacteria and 300 for the other organisms 
    LIMIT_JUMP_COVERAGE = 300
    #these are the additional parameters to Celera Assembler.  do not worry about performance, number or processors or batch sizes -- these are computed automatically. 
    #set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms.
    CA_PARAMETERS =  cgwErrorRate=0.15
    #minimum count k-mers used in error correction 1 means all k-mers are used.  one can increase to 2 if Illumina coverage >100
    KMER_COUNT_THRESHOLD = 1
    #whether to attempt to close gaps in scaffolds with Illumina data
    CLOSE_GAPS=1
    #auto-detected number of cpus to use
    NUM_THREADS = 16
    #this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*estimated_coverage
    JF_SIZE = 200000000
    #set this to 1 to use SOAPdenovo contigging/scaffolding module.  Assembly will be worse but will run faster. Useful for very large (>5Gbp) genomes from Illumina-only data
    SOAP_ASSEMBLY=0
    END

Then, I got the asemble.sh file and I ran it as well and got the following .out:

 [Sat Jun 16 22:32:45 CEST 2018] Processing pe library reads
    [Sat Jun 16 22:49:04 CEST 2018] Average PE read length 150
    [Sat Jun 16 22:49:05 CEST 2018] Using kmer size of 49 for the graph
    [Sat Jun 16 22:49:06 CEST 2018] MIN_Q_CHAR: 33
    WARNING: JF_SIZE set too low, increasing JF_SIZE to at least 1115876884, this automatic increase may be not enough!
    [Sat Jun 16 22:49:06 CEST 2018] Creating mer database for Quorum
    [Sat Jun 16 23:09:23 CEST 2018] Error correct PE.
    [Sat Jun 16 23:11:49 CEST 2018] Error correction of PE reads failed. Check pe.cor.log.

`and .error: `

    /panhome/TOOLS/MaSuRCA-3.2.6/assemble.sh: line 102: 46750 Aborted                 quorum_error_correct_reads -q $((MIN_Q_CHAR + 40)
    ) --contaminant=/panhome/TOOLS/MaSuRCA-3.2.6/bin/../share/adapter.jf -m 1 -s 1 -g 1 -a 3 -t 16 -w 10 -e 3 -M quorum_mer_db.jf pe.re
    named.fastq --no-discard -o pe.cor.tmp --verbose > quorum.err 2>&1

Does someone have an idea of what is going on here? Thanks for your help.

The 2 fasta files are comming from an illumina Hiseq 3000 150bp and the genome size of my specie is around 1.5 GB.

BenjaminGuinet commented 6 years ago

I tried to change the JF_Size with JF_SIZE = 25500000000 and got this error:

line 102: 25712 Aborted                 quorum_error_correct_reads -q $((MIN_Q_CHAR + 40)
) --contaminant=/panhome/bguinet/TOOLS/MaSuRCA-3.2.6/bin/../share/adapter.jf -m 1 -s 1 -g 1 -a 3 -t 16 -w 10 -e 3 -M quorum_mer_db.jf pe.re
named.fastq --no-discard -o pe.cor.tmp --verbose > quorum.err 2>&1

and the .out

[Sun Jun 17 11:40:30 CEST 2018] Processing pe library reads
[Sun Jun 17 11:50:47 CEST 2018] Average PE read length 150
[Sun Jun 17 11:50:47 CEST 2018] Using kmer size of 49 for the graph
[Sun Jun 17 11:50:48 CEST 2018] MIN_Q_CHAR: 33
[Sun Jun 17 11:50:48 CEST 2018] Creating mer database for Quorum
[Sun Jun 17 12:19:01 CEST 2018] Error correct PE.
[Sun Jun 17 12:35:01 CEST 2018] Error correction of PE reads failed. Check pe.cor.log.

and the frag.fastaq files are correct:


/pandata/LEPIWASP/ACG-0006_0027$ file -b -i frag_1.fastq
text/plain; charset=us-ascii
/pandata/LEPIWASP/ACG-0006_0027$ file -b -i frag_2.fastq
text/plain; charset=us-ascii

and I cannot check the pe.cor.log file because it does not exist.

alekseyzimin commented 6 years ago

How much coverage do you have? If you have more than 100x Illumina coverage, just use the first 100x and discard the rest. Too much coverage is the main reason for such failures.

amarquard commented 5 years ago

Hi Aleksey, I have a similar problem. Do you mean that we should use the first 100x reads and discard the rest - is there no way of using all of my reads, now that I've spent money on sequencing it deeper? I think I have somewhere <700x.

alekseyzimin commented 5 years ago

Yes, Illumina sequencing deeper than 100x for haploid genome and over 150x for diploid genome is detrimental to assembly quality. Unfortunately, there is no way to use extra data in a way that will benefit the assembly, as it will introduce errors and spurious duplications int the sequence.

You should just use first ~100x for haploid or inbred genomes and about ~150x for heterozygous genomes.

On Tue, Feb 5, 2019 at 8:30 AM amarquard notifications@github.com wrote:

Hi Aleksey, I have a similar problem. Do you mean that we should use the first 100x reads and discard the rest - is there no way of using all of my reads, now that I've spent money on sequencing it deeper? I think I have somewhere <700x.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460638164, or mute the thread https://github.com/notifications/unsubscribe-auth/AZ9zHWpp3rBJSyvoRVTgxK_fInCAUZjnks5vKYd4gaJpZM4Uqnsz .

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://masurca.blogspot.com

amarquard commented 5 years ago

Hi Aleksey, Thanks for taking the time. I am actually only using the superreads module of masurca, to use for alignment with HISAT2. Do you think this is still why it’s crashing then, or just out of memory? I was unsure how much memory to allocate when just using masurca-superreads. Do the same guidelines apply as for the masurca assembler? Best, Andrea

On 5 Feb 2019, at 15.57, Aleksey Zimin notifications@github.com wrote:

Yes, Illumina sequencing deeper than 100x for haploid genome and over 150x for diploid genome is detrimental to assembly quality. Unfortunately, there is no way to use extra data in a way that will benefit the assembly, as it will introduce errors and spurious duplications int the sequence.

You should just use first ~100x for haploid or inbred genomes and about ~150x for heterozygous genomes.

On Tue, Feb 5, 2019 at 8:30 AM amarquard notifications@github.com wrote:

Hi Aleksey, I have a similar problem. Do you mean that we should use the first 100x reads and discard the rest - is there no way of using all of my reads, now that I've spent money on sequencing it deeper? I think I have somewhere <700x.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460638164, or mute the thread https://github.com/notifications/unsubscribe-auth/AZ9zHWpp3rBJSyvoRVTgxK_fInCAUZjnks5vKYd4gaJpZM4Uqnsz .

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://masurca.blogspot.com — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460668098, or mute the thread https://github.com/notifications/unsubscribe-auth/AXcg9erf8NNBBLR5cuuXRogvEENdoi6Jks5vKZv2gaJpZM4Uqnsz.

alekseyzimin commented 5 years ago

Too high coverage may cause quorum to crash. I would not trust corrected reads/super-reads when using >100x coverage, the number of super reads will be too large and there will be many super-reads that are different by one base and that base will be a sequencing error, due to error biases in Illumina reads. When you have too much coverage you will have situations where multiple reads will have the same error at the same base thereby confirming each other. This is the biggest problem when using too much coverage.

On Tue, Feb 5, 2019 at 10:01 AM amarquard notifications@github.com wrote:

Hi Aleksey, Thanks for taking the time. I am actually only using the superreads module of masurca, to use for alignment with HISAT2. Do you think this is still why it’s crashing then, or just out of memory? I was unsure how much memory to allocate when just using masurca-superreads. Do the same guidelines apply as for the masurca assembler? Best, Andrea

On 5 Feb 2019, at 15.57, Aleksey Zimin notifications@github.com wrote:

Yes, Illumina sequencing deeper than 100x for haploid genome and over 150x for diploid genome is detrimental to assembly quality. Unfortunately, there is no way to use extra data in a way that will benefit the assembly, as it will introduce errors and spurious duplications int the sequence.

You should just use first ~100x for haploid or inbred genomes and about ~150x for heterozygous genomes.

On Tue, Feb 5, 2019 at 8:30 AM amarquard notifications@github.com wrote:

Hi Aleksey, I have a similar problem. Do you mean that we should use the first 100x reads and discard the rest - is there no way of using all of my reads, now that I've spent money on sequencing it deeper? I think I have somewhere <700x.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460638164>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AZ9zHWpp3rBJSyvoRVTgxK_fInCAUZjnks5vKYd4gaJpZM4Uqnsz

.

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://masurca.blogspot.com — You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460668098>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AXcg9erf8NNBBLR5cuuXRogvEENdoi6Jks5vKZv2gaJpZM4Uqnsz .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460669394, or mute the thread https://github.com/notifications/unsubscribe-auth/AZ9zHVLSEg7tAeHgA6vCX8opECZRON1sks5vKZy9gaJpZM4Uqnsz .

amarquard commented 5 years ago

Thanks again for your valuable comments. I see now. Do you have a recommendation for a certain way to downsample (for example a way that tends to discard low quality reads rather than high quality, or something along those lines….) or do you recommend just randomly selecting a subset of reads to use? (In case it’s relevant to my question, I am working with human RNAseq data intended for analysis of alternative splicing)

Best, Andrea

On 5 Feb 2019, at 16.20, Aleksey Zimin notifications@github.com wrote:

Too high coverage may cause quorum to crash. I would not trust corrected reads/super-reads when using >100x coverage, the number of super reads will be too large and there will be many super-reads that are different by one base and that base will be a sequencing error, due to error biases in Illumina reads. When you have too much coverage you will have situations where multiple reads will have the same error at the same base thereby confirming each other. This is the biggest problem when using too much coverage.

On Tue, Feb 5, 2019 at 10:01 AM amarquard notifications@github.com wrote:

Hi Aleksey, Thanks for taking the time. I am actually only using the superreads module of masurca, to use for alignment with HISAT2. Do you think this is still why it’s crashing then, or just out of memory? I was unsure how much memory to allocate when just using masurca-superreads. Do the same guidelines apply as for the masurca assembler? Best, Andrea

On 5 Feb 2019, at 15.57, Aleksey Zimin notifications@github.com wrote:

Yes, Illumina sequencing deeper than 100x for haploid genome and over 150x for diploid genome is detrimental to assembly quality. Unfortunately, there is no way to use extra data in a way that will benefit the assembly, as it will introduce errors and spurious duplications int the sequence.

You should just use first ~100x for haploid or inbred genomes and about ~150x for heterozygous genomes.

On Tue, Feb 5, 2019 at 8:30 AM amarquard notifications@github.com wrote:

Hi Aleksey, I have a similar problem. Do you mean that we should use the first 100x reads and discard the rest - is there no way of using all of my reads, now that I've spent money on sequencing it deeper? I think I have somewhere <700x.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460638164>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AZ9zHWpp3rBJSyvoRVTgxK_fInCAUZjnks5vKYd4gaJpZM4Uqnsz

.

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://masurca.blogspot.com — You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460668098>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AXcg9erf8NNBBLR5cuuXRogvEENdoi6Jks5vKZv2gaJpZM4Uqnsz .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460669394, or mute the thread https://github.com/notifications/unsubscribe-auth/AZ9zHVLSEg7tAeHgA6vCX8opECZRON1sks5vKZy9gaJpZM4Uqnsz .

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://masurca.blogspot.com — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460676870, or mute the thread https://github.com/notifications/unsubscribe-auth/AXcg9XP4loB-Of4U-4x80l5ky7GhPjAPks5vKaE-gaJpZM4Uqnsz.

alekseyzimin commented 5 years ago

Sorry, I did not get that you are assembling RNAseq. For RNAseq data I recommend using SuperReads_RNA package from github: https://github.com/alekseyzimin/SuperReads_RNA Usage is similar to masurca, just the main exec is named createSuperReads_RNA instead of masurca.

My earlier recommendations of 100x coverage only apply to de novo genome assembly. For RNAseq data the coverage is so variable that downsampling may lead to loss of rare splice variants, and therefore you have to use higher coverage. Still 700x is a bit too much, so try taking half of the data. You can select the reads with the highest overall quality.

On Tue, Feb 5, 2019 at 10:39 AM amarquard notifications@github.com wrote:

Thanks again for your valuable comments. I see now. Do you have a recommendation for a certain way to downsample (for example a way that tends to discard low quality reads rather than high quality, or something along those lines….) or do you recommend just randomly selecting a subset of reads to use? (In case it’s relevant to my question, I am working with human RNAseq data intended for analysis of alternative splicing)

Best, Andrea

On 5 Feb 2019, at 16.20, Aleksey Zimin notifications@github.com wrote:

Too high coverage may cause quorum to crash. I would not trust corrected reads/super-reads when using >100x coverage, the number of super reads will be too large and there will be many super-reads that are different by one base and that base will be a sequencing error, due to error biases in Illumina reads. When you have too much coverage you will have situations where multiple reads will have the same error at the same base thereby confirming each other. This is the biggest problem when using too much coverage.

On Tue, Feb 5, 2019 at 10:01 AM amarquard notifications@github.com wrote:

Hi Aleksey, Thanks for taking the time. I am actually only using the superreads module of masurca, to use for alignment with HISAT2. Do you think this is still why it’s crashing then, or just out of memory? I was unsure how much memory to allocate when just using masurca-superreads. Do the same guidelines apply as for the masurca assembler? Best, Andrea

On 5 Feb 2019, at 15.57, Aleksey Zimin notifications@github.com wrote:

Yes, Illumina sequencing deeper than 100x for haploid genome and over 150x for diploid genome is detrimental to assembly quality. Unfortunately, there is no way to use extra data in a way that will benefit the assembly, as it will introduce errors and spurious duplications int the sequence.

You should just use first ~100x for haploid or inbred genomes and about ~150x for heterozygous genomes.

On Tue, Feb 5, 2019 at 8:30 AM amarquard notifications@github.com wrote:

Hi Aleksey, I have a similar problem. Do you mean that we should use the first 100x reads and discard the rest - is there no way of using all of my reads, now that I've spent money on sequencing it deeper? I think I have somewhere <700x.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460638164>,

or mute the thread <

https://github.com/notifications/unsubscribe-auth/AZ9zHWpp3rBJSyvoRVTgxK_fInCAUZjnks5vKYd4gaJpZM4Uqnsz

.

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://masurca.blogspot.com — You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460668098>, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AXcg9erf8NNBBLR5cuuXRogvEENdoi6Jks5vKZv2gaJpZM4Uqnsz

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460669394>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AZ9zHVLSEg7tAeHgA6vCX8opECZRON1sks5vKZy9gaJpZM4Uqnsz

.

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://masurca.blogspot.com — You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460676870>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AXcg9XP4loB-Of4U-4x80l5ky7GhPjAPks5vKaE-gaJpZM4Uqnsz .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460685133, or mute the thread https://github.com/notifications/unsubscribe-auth/AZ9zHTbYirapYwMwbpINy06SrzptwXfUks5vKaXIgaJpZM4Uqnsz .

amarquard commented 5 years ago

Thank you for your recommendations. I was following the steps mentioned here from your lab https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual (which mentions masurca) I will use SuperReads_RNA instead then, but is there anywhere I can read a bit more about the tool and what it does? (Perhaps a publication?there is not a lot of background on the GitHub page…)

On 5 Feb 2019, at 16.58, Aleksey Zimin notifications@github.com wrote:

Sorry, I did not get that you are assembling RNAseq. For RNAseq data I recommend using SuperReads_RNA package from github: https://github.com/alekseyzimin/SuperReads_RNA Usage is similar to masurca, just the main exec is named createSuperReads_RNA instead of masurca.

My earlier recommendations of 100x coverage only apply to de novo genome assembly. For RNAseq data the coverage is so variable that downsampling may lead to loss of rare splice variants, and therefore you have to use higher coverage. Still 700x is a bit too much, so try taking half of the data. You can select the reads with the highest overall quality.

On Tue, Feb 5, 2019 at 10:39 AM amarquard notifications@github.com wrote:

Thanks again for your valuable comments. I see now. Do you have a recommendation for a certain way to downsample (for example a way that tends to discard low quality reads rather than high quality, or something along those lines….) or do you recommend just randomly selecting a subset of reads to use? (In case it’s relevant to my question, I am working with human RNAseq data intended for analysis of alternative splicing)

Best, Andrea

On 5 Feb 2019, at 16.20, Aleksey Zimin notifications@github.com wrote:

Too high coverage may cause quorum to crash. I would not trust corrected reads/super-reads when using >100x coverage, the number of super reads will be too large and there will be many super-reads that are different by one base and that base will be a sequencing error, due to error biases in Illumina reads. When you have too much coverage you will have situations where multiple reads will have the same error at the same base thereby confirming each other. This is the biggest problem when using too much coverage.

On Tue, Feb 5, 2019 at 10:01 AM amarquard notifications@github.com wrote:

Hi Aleksey, Thanks for taking the time. I am actually only using the superreads module of masurca, to use for alignment with HISAT2. Do you think this is still why it’s crashing then, or just out of memory? I was unsure how much memory to allocate when just using masurca-superreads. Do the same guidelines apply as for the masurca assembler? Best, Andrea

On 5 Feb 2019, at 15.57, Aleksey Zimin notifications@github.com wrote:

Yes, Illumina sequencing deeper than 100x for haploid genome and over 150x for diploid genome is detrimental to assembly quality. Unfortunately, there is no way to use extra data in a way that will benefit the assembly, as it will introduce errors and spurious duplications int the sequence.

You should just use first ~100x for haploid or inbred genomes and about ~150x for heterozygous genomes.

On Tue, Feb 5, 2019 at 8:30 AM amarquard notifications@github.com wrote:

Hi Aleksey, I have a similar problem. Do you mean that we should use the first 100x reads and discard the rest - is there no way of using all of my reads, now that I've spent money on sequencing it deeper? I think I have somewhere <700x.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460638164>,

or mute the thread <

https://github.com/notifications/unsubscribe-auth/AZ9zHWpp3rBJSyvoRVTgxK_fInCAUZjnks5vKYd4gaJpZM4Uqnsz

.

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://masurca.blogspot.com — You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460668098>, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AXcg9erf8NNBBLR5cuuXRogvEENdoi6Jks5vKZv2gaJpZM4Uqnsz

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460669394>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AZ9zHVLSEg7tAeHgA6vCX8opECZRON1sks5vKZy9gaJpZM4Uqnsz

.

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://masurca.blogspot.com — You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460676870>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AXcg9XP4loB-Of4U-4x80l5ky7GhPjAPks5vKaE-gaJpZM4Uqnsz .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460685133, or mute the thread https://github.com/notifications/unsubscribe-auth/AZ9zHTbYirapYwMwbpINy06SrzptwXfUks5vKaXIgaJpZM4Uqnsz .

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://masurca.blogspot.com — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460692719, or mute the thread https://github.com/notifications/unsubscribe-auth/AXcg9SEkBDBY3opYwE2n1VgQAYbR9725ks5vKaovgaJpZM4Uqnsz.

alekseyzimin commented 5 years ago

This tool is basically subset of masurca package with modifications for rnaseq. It will be part of publication of stringtie2. It computes super reads for rnaseq data.

On Tue, Feb 5, 2019, 3:44 PM amarquard notifications@github.com wrote:

Thank you for your recommendations. I was following the steps mentioned here from your lab https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual (which mentions masurca) I will use SuperReads_RNA instead then, but is there anywhere I can read a bit more about the tool and what it does? (Perhaps a publication?there is not a lot of background on the GitHub page…)

On 5 Feb 2019, at 16.58, Aleksey Zimin notifications@github.com wrote:

Sorry, I did not get that you are assembling RNAseq. For RNAseq data I recommend using SuperReads_RNA package from github: https://github.com/alekseyzimin/SuperReads_RNA Usage is similar to masurca, just the main exec is named createSuperReads_RNA instead of masurca.

My earlier recommendations of 100x coverage only apply to de novo genome assembly. For RNAseq data the coverage is so variable that downsampling may lead to loss of rare splice variants, and therefore you have to use higher coverage. Still 700x is a bit too much, so try taking half of the data. You can select the reads with the highest overall quality.

On Tue, Feb 5, 2019 at 10:39 AM amarquard notifications@github.com wrote:

Thanks again for your valuable comments. I see now. Do you have a recommendation for a certain way to downsample (for example a way that tends to discard low quality reads rather than high quality, or something along those lines….) or do you recommend just randomly selecting a subset of reads to use? (In case it’s relevant to my question, I am working with human RNAseq data intended for analysis of alternative splicing)

Best, Andrea

On 5 Feb 2019, at 16.20, Aleksey Zimin notifications@github.com wrote:

Too high coverage may cause quorum to crash. I would not trust corrected reads/super-reads when using >100x coverage, the number of super reads will be too large and there will be many super-reads that are different by one base and that base will be a sequencing error, due to error biases in Illumina reads. When you have too much coverage you will have situations where multiple reads will have the same error at the same base thereby confirming each other. This is the biggest problem when using too much coverage.

On Tue, Feb 5, 2019 at 10:01 AM amarquard notifications@github.com wrote:

Hi Aleksey, Thanks for taking the time. I am actually only using the superreads module of masurca, to use for alignment with HISAT2. Do you think this is still why it’s crashing then, or just out of memory? I was unsure how much memory to allocate when just using masurca-superreads. Do the same guidelines apply as for the masurca assembler? Best, Andrea

On 5 Feb 2019, at 15.57, Aleksey Zimin <notifications@github.com

wrote:

Yes, Illumina sequencing deeper than 100x for haploid genome and over 150x for diploid genome is detrimental to assembly quality. Unfortunately, there is no way to use extra data in a way that will benefit the assembly, as it will introduce errors and spurious duplications int the sequence.

You should just use first ~100x for haploid or inbred genomes and about ~150x for heterozygous genomes.

On Tue, Feb 5, 2019 at 8:30 AM amarquard < notifications@github.com> wrote:

Hi Aleksey, I have a similar problem. Do you mean that we should use the first 100x reads and discard the rest - is there no way of using all of my reads, now that I've spent money on sequencing it deeper? I think I have somewhere <700x.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460638164>,

or mute the thread <

https://github.com/notifications/unsubscribe-auth/AZ9zHWpp3rBJSyvoRVTgxK_fInCAUZjnks5vKYd4gaJpZM4Uqnsz

.

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://masurca.blogspot.com — You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460668098>,

or mute the thread <

https://github.com/notifications/unsubscribe-auth/AXcg9erf8NNBBLR5cuuXRogvEENdoi6Jks5vKZv2gaJpZM4Uqnsz

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460669394>,

or mute the thread <

https://github.com/notifications/unsubscribe-auth/AZ9zHVLSEg7tAeHgA6vCX8opECZRON1sks5vKZy9gaJpZM4Uqnsz

.

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://masurca.blogspot.com — You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460676870>, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AXcg9XP4loB-Of4U-4x80l5ky7GhPjAPks5vKaE-gaJpZM4Uqnsz

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460685133>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AZ9zHTbYirapYwMwbpINy06SrzptwXfUks5vKaXIgaJpZM4Uqnsz

.

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://masurca.blogspot.com — You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460692719>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AXcg9SEkBDBY3opYwE2n1VgQAYbR9725ks5vKaovgaJpZM4Uqnsz .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460796083, or mute the thread https://github.com/notifications/unsubscribe-auth/AZ9zHc6kHpY9uP4jsRyQjWr5Oaa1SpB6ks5vKe0SgaJpZM4Uqnsz .

amarquard commented 5 years ago

Thanks a lot! You have been very helpful

Den 5. feb. 2019 kl. 21.55 skrev Aleksey Zimin notifications@github.com:

This tool is basically subset of masurca package with modifications for rnaseq. It will be part of publication of stringtie2. It computes super reads for rnaseq data.

On Tue, Feb 5, 2019, 3:44 PM amarquard notifications@github.com wrote:

Thank you for your recommendations. I was following the steps mentioned here from your lab https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual (which mentions masurca) I will use SuperReads_RNA instead then, but is there anywhere I can read a bit more about the tool and what it does? (Perhaps a publication?there is not a lot of background on the GitHub page…)

On 5 Feb 2019, at 16.58, Aleksey Zimin notifications@github.com wrote:

Sorry, I did not get that you are assembling RNAseq. For RNAseq data I recommend using SuperReads_RNA package from github: https://github.com/alekseyzimin/SuperReads_RNA Usage is similar to masurca, just the main exec is named createSuperReads_RNA instead of masurca.

My earlier recommendations of 100x coverage only apply to de novo genome assembly. For RNAseq data the coverage is so variable that downsampling may lead to loss of rare splice variants, and therefore you have to use higher coverage. Still 700x is a bit too much, so try taking half of the data. You can select the reads with the highest overall quality.

On Tue, Feb 5, 2019 at 10:39 AM amarquard notifications@github.com wrote:

Thanks again for your valuable comments. I see now. Do you have a recommendation for a certain way to downsample (for example a way that tends to discard low quality reads rather than high quality, or something along those lines….) or do you recommend just randomly selecting a subset of reads to use? (In case it’s relevant to my question, I am working with human RNAseq data intended for analysis of alternative splicing)

Best, Andrea

On 5 Feb 2019, at 16.20, Aleksey Zimin notifications@github.com wrote:

Too high coverage may cause quorum to crash. I would not trust corrected reads/super-reads when using >100x coverage, the number of super reads will be too large and there will be many super-reads that are different by one base and that base will be a sequencing error, due to error biases in Illumina reads. When you have too much coverage you will have situations where multiple reads will have the same error at the same base thereby confirming each other. This is the biggest problem when using too much coverage.

On Tue, Feb 5, 2019 at 10:01 AM amarquard notifications@github.com wrote:

Hi Aleksey, Thanks for taking the time. I am actually only using the superreads module of masurca, to use for alignment with HISAT2. Do you think this is still why it’s crashing then, or just out of memory? I was unsure how much memory to allocate when just using masurca-superreads. Do the same guidelines apply as for the masurca assembler? Best, Andrea

On 5 Feb 2019, at 15.57, Aleksey Zimin <notifications@github.com

wrote:

Yes, Illumina sequencing deeper than 100x for haploid genome and over 150x for diploid genome is detrimental to assembly quality. Unfortunately, there is no way to use extra data in a way that will benefit the assembly, as it will introduce errors and spurious duplications int the sequence.

You should just use first ~100x for haploid or inbred genomes and about ~150x for heterozygous genomes.

On Tue, Feb 5, 2019 at 8:30 AM amarquard < notifications@github.com> wrote:

Hi Aleksey, I have a similar problem. Do you mean that we should use the first 100x reads and discard the rest - is there no way of using all of my reads, now that I've spent money on sequencing it deeper? I think I have somewhere <700x.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460638164>,

or mute the thread <

https://github.com/notifications/unsubscribe-auth/AZ9zHWpp3rBJSyvoRVTgxK_fInCAUZjnks5vKYd4gaJpZM4Uqnsz

.

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://masurca.blogspot.com — You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460668098>,

or mute the thread <

https://github.com/notifications/unsubscribe-auth/AXcg9erf8NNBBLR5cuuXRogvEENdoi6Jks5vKZv2gaJpZM4Uqnsz

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460669394>,

or mute the thread <

https://github.com/notifications/unsubscribe-auth/AZ9zHVLSEg7tAeHgA6vCX8opECZRON1sks5vKZy9gaJpZM4Uqnsz

.

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://masurca.blogspot.com — You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460676870>, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AXcg9XP4loB-Of4U-4x80l5ky7GhPjAPks5vKaE-gaJpZM4Uqnsz

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460685133>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AZ9zHTbYirapYwMwbpINy06SrzptwXfUks5vKaXIgaJpZM4Uqnsz

.

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://masurca.blogspot.com — You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460692719>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AXcg9SEkBDBY3opYwE2n1VgQAYbR9725ks5vKaovgaJpZM4Uqnsz .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460796083, or mute the thread https://github.com/notifications/unsubscribe-auth/AZ9zHc6kHpY9uP4jsRyQjWr5Oaa1SpB6ks5vKe0SgaJpZM4Uqnsz .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

amarquard commented 5 years ago

One last question. Does SuperReads_RNA (like masurca) take care of adaptor removal?

On 5 Feb 2019, at 22.17, Andrea Marion Marquard andreamarionmarquard@gmail.com wrote:

Thanks a lot! You have been very helpful

Den 5. feb. 2019 kl. 21.55 skrev Aleksey Zimin <notifications@github.com mailto:notifications@github.com>:

This tool is basically subset of masurca package with modifications for rnaseq. It will be part of publication of stringtie2. It computes super reads for rnaseq data.

On Tue, Feb 5, 2019, 3:44 PM amarquard <notifications@github.com mailto:notifications@github.com> wrote:

Thank you for your recommendations. I was following the steps mentioned here from your lab https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual (which mentions masurca) I will use SuperReads_RNA instead then, but is there anywhere I can read a bit more about the tool and what it does? (Perhaps a publication?there is not a lot of background on the GitHub page…)

On 5 Feb 2019, at 16.58, Aleksey Zimin <notifications@github.com mailto:notifications@github.com> wrote:

Sorry, I did not get that you are assembling RNAseq. For RNAseq data I recommend using SuperReads_RNA package from github: https://github.com/alekseyzimin/SuperReads_RNA https://github.com/alekseyzimin/SuperReads_RNA Usage is similar to masurca, just the main exec is named createSuperReads_RNA instead of masurca.

My earlier recommendations of 100x coverage only apply to de novo genome assembly. For RNAseq data the coverage is so variable that downsampling may lead to loss of rare splice variants, and therefore you have to use higher coverage. Still 700x is a bit too much, so try taking half of the data. You can select the reads with the highest overall quality.

On Tue, Feb 5, 2019 at 10:39 AM amarquard <notifications@github.com mailto:notifications@github.com> wrote:

Thanks again for your valuable comments. I see now. Do you have a recommendation for a certain way to downsample (for example a way that tends to discard low quality reads rather than high quality, or something along those lines….) or do you recommend just randomly selecting a subset of reads to use? (In case it’s relevant to my question, I am working with human RNAseq data intended for analysis of alternative splicing)

Best, Andrea

On 5 Feb 2019, at 16.20, Aleksey Zimin <notifications@github.com mailto:notifications@github.com> wrote:

Too high coverage may cause quorum to crash. I would not trust corrected reads/super-reads when using >100x coverage, the number of super reads will be too large and there will be many super-reads that are different by one base and that base will be a sequencing error, due to error biases in Illumina reads. When you have too much coverage you will have situations where multiple reads will have the same error at the same base thereby confirming each other. This is the biggest problem when using too much coverage.

On Tue, Feb 5, 2019 at 10:01 AM amarquard <notifications@github.com mailto:notifications@github.com> wrote:

Hi Aleksey, Thanks for taking the time. I am actually only using the superreads module of masurca, to use for alignment with HISAT2. Do you think this is still why it’s crashing then, or just out of memory? I was unsure how much memory to allocate when just using masurca-superreads. Do the same guidelines apply as for the masurca assembler? Best, Andrea

On 5 Feb 2019, at 15.57, Aleksey Zimin <notifications@github.com mailto:notifications@github.com

wrote:

Yes, Illumina sequencing deeper than 100x for haploid genome and over 150x for diploid genome is detrimental to assembly quality. Unfortunately, there is no way to use extra data in a way that will benefit the assembly, as it will introduce errors and spurious duplications int the sequence.

You should just use first ~100x for haploid or inbred genomes and about ~150x for heterozygous genomes.

On Tue, Feb 5, 2019 at 8:30 AM amarquard < notifications@github.com mailto:notifications@github.com> wrote:

Hi Aleksey, I have a similar problem. Do you mean that we should use the first 100x reads and discard the rest - is there no way of using all of my reads, now that I've spent money on sequencing it deeper? I think I have somewhere <700x.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460638164 https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460638164>,

or mute the thread <

https://github.com/notifications/unsubscribe-auth/AZ9zHWpp3rBJSyvoRVTgxK_fInCAUZjnks5vKYd4gaJpZM4Uqnsz https://github.com/notifications/unsubscribe-auth/AZ9zHWpp3rBJSyvoRVTgxK_fInCAUZjnks5vKYd4gaJpZM4Uqnsz

.

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://www.genome.umd.edu/ http://masurca.blogspot.com http://masurca.blogspot.com/ — You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460668098 https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460668098>,

or mute the thread <

https://github.com/notifications/unsubscribe-auth/AXcg9erf8NNBBLR5cuuXRogvEENdoi6Jks5vKZv2gaJpZM4Uqnsz https://github.com/notifications/unsubscribe-auth/AXcg9erf8NNBBLR5cuuXRogvEENdoi6Jks5vKZv2gaJpZM4Uqnsz

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460669394 https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460669394>,

or mute the thread <

https://github.com/notifications/unsubscribe-auth/AZ9zHVLSEg7tAeHgA6vCX8opECZRON1sks5vKZy9gaJpZM4Uqnsz https://github.com/notifications/unsubscribe-auth/AZ9zHVLSEg7tAeHgA6vCX8opECZRON1sks5vKZy9gaJpZM4Uqnsz

.

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://www.genome.umd.edu/ http://masurca.blogspot.com http://masurca.blogspot.com/ — You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460676870 https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460676870>, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AXcg9XP4loB-Of4U-4x80l5ky7GhPjAPks5vKaE-gaJpZM4Uqnsz https://github.com/notifications/unsubscribe-auth/AXcg9XP4loB-Of4U-4x80l5ky7GhPjAPks5vKaE-gaJpZM4Uqnsz

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460685133 https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460685133>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AZ9zHTbYirapYwMwbpINy06SrzptwXfUks5vKaXIgaJpZM4Uqnsz https://github.com/notifications/unsubscribe-auth/AZ9zHTbYirapYwMwbpINy06SrzptwXfUks5vKaXIgaJpZM4Uqnsz

.

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://www.genome.umd.edu/ http://masurca.blogspot.com http://masurca.blogspot.com/ — You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460692719 https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460692719>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AXcg9SEkBDBY3opYwE2n1VgQAYbR9725ks5vKaovgaJpZM4Uqnsz https://github.com/notifications/unsubscribe-auth/AXcg9SEkBDBY3opYwE2n1VgQAYbR9725ks5vKaovgaJpZM4Uqnsz .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460796083 https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460796083>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AZ9zHc6kHpY9uP4jsRyQjWr5Oaa1SpB6ks5vKe0SgaJpZM4Uqnsz https://github.com/notifications/unsubscribe-auth/AZ9zHc6kHpY9uP4jsRyQjWr5Oaa1SpB6ks5vKe0SgaJpZM4Uqnsz> .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460800067, or mute the thread https://github.com/notifications/unsubscribe-auth/AXcg9VLcXCH0pFrcUju072y4OcrNU5goks5vKe_dgaJpZM4Uqnsz.

amarquard commented 5 years ago

Also, I can’t figure out whether mean and stdev in the config file should be READ length or INSERT length, and whether they are important when just creating superreads (not de novo assembly). (My reads are 150bp paired end.) Hope you have a second to help me (again). Thanks in advance!

On 6 Feb 2019, at 11.10, Andrea Marion Marquard andreamarionmarquard@gmail.com wrote:

One last question. Does SuperReads_RNA (like masurca) take care of adaptor removal?

On 5 Feb 2019, at 22.17, Andrea Marion Marquard <andreamarionmarquard@gmail.com mailto:andreamarionmarquard@gmail.com> wrote:

Thanks a lot! You have been very helpful

Den 5. feb. 2019 kl. 21.55 skrev Aleksey Zimin <notifications@github.com mailto:notifications@github.com>:

This tool is basically subset of masurca package with modifications for rnaseq. It will be part of publication of stringtie2. It computes super reads for rnaseq data.

On Tue, Feb 5, 2019, 3:44 PM amarquard <notifications@github.com mailto:notifications@github.com> wrote:

Thank you for your recommendations. I was following the steps mentioned here from your lab https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual (which mentions masurca) I will use SuperReads_RNA instead then, but is there anywhere I can read a bit more about the tool and what it does? (Perhaps a publication?there is not a lot of background on the GitHub page…)

On 5 Feb 2019, at 16.58, Aleksey Zimin <notifications@github.com mailto:notifications@github.com> wrote:

Sorry, I did not get that you are assembling RNAseq. For RNAseq data I recommend using SuperReads_RNA package from github: https://github.com/alekseyzimin/SuperReads_RNA https://github.com/alekseyzimin/SuperReads_RNA Usage is similar to masurca, just the main exec is named createSuperReads_RNA instead of masurca.

My earlier recommendations of 100x coverage only apply to de novo genome assembly. For RNAseq data the coverage is so variable that downsampling may lead to loss of rare splice variants, and therefore you have to use higher coverage. Still 700x is a bit too much, so try taking half of the data. You can select the reads with the highest overall quality.

On Tue, Feb 5, 2019 at 10:39 AM amarquard <notifications@github.com mailto:notifications@github.com> wrote:

Thanks again for your valuable comments. I see now. Do you have a recommendation for a certain way to downsample (for example a way that tends to discard low quality reads rather than high quality, or something along those lines….) or do you recommend just randomly selecting a subset of reads to use? (In case it’s relevant to my question, I am working with human RNAseq data intended for analysis of alternative splicing)

Best, Andrea

On 5 Feb 2019, at 16.20, Aleksey Zimin <notifications@github.com mailto:notifications@github.com> wrote:

Too high coverage may cause quorum to crash. I would not trust corrected reads/super-reads when using >100x coverage, the number of super reads will be too large and there will be many super-reads that are different by one base and that base will be a sequencing error, due to error biases in Illumina reads. When you have too much coverage you will have situations where multiple reads will have the same error at the same base thereby confirming each other. This is the biggest problem when using too much coverage.

On Tue, Feb 5, 2019 at 10:01 AM amarquard <notifications@github.com mailto:notifications@github.com> wrote:

Hi Aleksey, Thanks for taking the time. I am actually only using the superreads module of masurca, to use for alignment with HISAT2. Do you think this is still why it’s crashing then, or just out of memory? I was unsure how much memory to allocate when just using masurca-superreads. Do the same guidelines apply as for the masurca assembler? Best, Andrea

On 5 Feb 2019, at 15.57, Aleksey Zimin <notifications@github.com mailto:notifications@github.com

wrote:

Yes, Illumina sequencing deeper than 100x for haploid genome and over 150x for diploid genome is detrimental to assembly quality. Unfortunately, there is no way to use extra data in a way that will benefit the assembly, as it will introduce errors and spurious duplications int the sequence.

You should just use first ~100x for haploid or inbred genomes and about ~150x for heterozygous genomes.

On Tue, Feb 5, 2019 at 8:30 AM amarquard < notifications@github.com mailto:notifications@github.com> wrote:

Hi Aleksey, I have a similar problem. Do you mean that we should use the first 100x reads and discard the rest - is there no way of using all of my reads, now that I've spent money on sequencing it deeper? I think I have somewhere <700x.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460638164 https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460638164>,

or mute the thread <

https://github.com/notifications/unsubscribe-auth/AZ9zHWpp3rBJSyvoRVTgxK_fInCAUZjnks5vKYd4gaJpZM4Uqnsz https://github.com/notifications/unsubscribe-auth/AZ9zHWpp3rBJSyvoRVTgxK_fInCAUZjnks5vKYd4gaJpZM4Uqnsz

.

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://www.genome.umd.edu/ http://masurca.blogspot.com http://masurca.blogspot.com/ — You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460668098 https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460668098>,

or mute the thread <

https://github.com/notifications/unsubscribe-auth/AXcg9erf8NNBBLR5cuuXRogvEENdoi6Jks5vKZv2gaJpZM4Uqnsz https://github.com/notifications/unsubscribe-auth/AXcg9erf8NNBBLR5cuuXRogvEENdoi6Jks5vKZv2gaJpZM4Uqnsz

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460669394 https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460669394>,

or mute the thread <

https://github.com/notifications/unsubscribe-auth/AZ9zHVLSEg7tAeHgA6vCX8opECZRON1sks5vKZy9gaJpZM4Uqnsz https://github.com/notifications/unsubscribe-auth/AZ9zHVLSEg7tAeHgA6vCX8opECZRON1sks5vKZy9gaJpZM4Uqnsz

.

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://www.genome.umd.edu/ http://masurca.blogspot.com http://masurca.blogspot.com/ — You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460676870 https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460676870>, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AXcg9XP4loB-Of4U-4x80l5ky7GhPjAPks5vKaE-gaJpZM4Uqnsz https://github.com/notifications/unsubscribe-auth/AXcg9XP4loB-Of4U-4x80l5ky7GhPjAPks5vKaE-gaJpZM4Uqnsz

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460685133 https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460685133>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AZ9zHTbYirapYwMwbpINy06SrzptwXfUks5vKaXIgaJpZM4Uqnsz https://github.com/notifications/unsubscribe-auth/AZ9zHTbYirapYwMwbpINy06SrzptwXfUks5vKaXIgaJpZM4Uqnsz

.

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://www.genome.umd.edu/ http://masurca.blogspot.com http://masurca.blogspot.com/ — You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460692719 https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460692719>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AXcg9SEkBDBY3opYwE2n1VgQAYbR9725ks5vKaovgaJpZM4Uqnsz https://github.com/notifications/unsubscribe-auth/AXcg9SEkBDBY3opYwE2n1VgQAYbR9725ks5vKaovgaJpZM4Uqnsz .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460796083 https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460796083>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AZ9zHc6kHpY9uP4jsRyQjWr5Oaa1SpB6ks5vKe0SgaJpZM4Uqnsz https://github.com/notifications/unsubscribe-auth/AZ9zHc6kHpY9uP4jsRyQjWr5Oaa1SpB6ks5vKe0SgaJpZM4Uqnsz> .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460800067, or mute the thread https://github.com/notifications/unsubscribe-auth/AXcg9VLcXCH0pFrcUju072y4OcrNU5goks5vKe_dgaJpZM4Uqnsz.

alekseyzimin commented 5 years ago

Yes, in the same way MaSuRCA does.

On Wed, Feb 6, 2019 at 5:10 AM amarquard notifications@github.com wrote:

One last question. Does SuperReads_RNA (like masurca) take care of adaptor removal?

On 5 Feb 2019, at 22.17, Andrea Marion Marquard < andreamarionmarquard@gmail.com> wrote:

Thanks a lot! You have been very helpful

Den 5. feb. 2019 kl. 21.55 skrev Aleksey Zimin <notifications@github.com mailto:notifications@github.com>:

This tool is basically subset of masurca package with modifications for rnaseq. It will be part of publication of stringtie2. It computes super reads for rnaseq data.

On Tue, Feb 5, 2019, 3:44 PM amarquard <notifications@github.com mailto:notifications@github.com> wrote:

Thank you for your recommendations. I was following the steps mentioned here from your lab https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual < https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual> (which mentions masurca) I will use SuperReads_RNA instead then, but is there anywhere I can read a bit more about the tool and what it does? (Perhaps a publication?there is not a lot of background on the GitHub page…)

On 5 Feb 2019, at 16.58, Aleksey Zimin <notifications@github.com mailto:notifications@github.com> wrote:

Sorry, I did not get that you are assembling RNAseq. For RNAseq data I recommend using SuperReads_RNA package from github: https://github.com/alekseyzimin/SuperReads_RNA < https://github.com/alekseyzimin/SuperReads_RNA> Usage is similar to masurca, just the main exec is named createSuperReads_RNA instead of masurca.

My earlier recommendations of 100x coverage only apply to de novo genome assembly. For RNAseq data the coverage is so variable that downsampling may lead to loss of rare splice variants, and therefore you have to use higher coverage. Still 700x is a bit too much, so try taking half of the data. You can select the reads with the highest overall quality.

On Tue, Feb 5, 2019 at 10:39 AM amarquard <notifications@github.com mailto:notifications@github.com> wrote:

Thanks again for your valuable comments. I see now. Do you have a recommendation for a certain way to downsample (for example a way that tends to discard low quality reads rather than high quality, or something along those lines….) or do you recommend just randomly selecting a subset of reads to use? (In case it’s relevant to my question, I am working with human RNAseq data intended for analysis of alternative splicing)

Best, Andrea

On 5 Feb 2019, at 16.20, Aleksey Zimin < notifications@github.com mailto:notifications@github.com> wrote:

Too high coverage may cause quorum to crash. I would not trust corrected reads/super-reads when using >100x coverage, the number of super reads will be too large and there will be many super-reads that are different by one base and that base will be a sequencing error, due to error biases in Illumina reads. When you have too much coverage you will have situations where multiple reads will have the same error at the same base thereby confirming each other. This is the biggest problem when using too much coverage.

On Tue, Feb 5, 2019 at 10:01 AM amarquard < notifications@github.com mailto:notifications@github.com> wrote:

Hi Aleksey, Thanks for taking the time. I am actually only using the superreads module of masurca, to use for alignment with HISAT2. Do you think this is still why it’s crashing then, or just out of memory? I was unsure how much memory to allocate when just using masurca-superreads. Do the same guidelines apply as for the masurca assembler? Best, Andrea

On 5 Feb 2019, at 15.57, Aleksey Zimin < notifications@github.com mailto:notifications@github.com

wrote:

Yes, Illumina sequencing deeper than 100x for haploid genome and over 150x for diploid genome is detrimental to assembly quality. Unfortunately, there is no way to use extra data in a way that will benefit the assembly, as it will introduce errors and spurious duplications int the sequence.

You should just use first ~100x for haploid or inbred genomes and about ~150x for heterozygous genomes.

On Tue, Feb 5, 2019 at 8:30 AM amarquard < notifications@github.com mailto:notifications@github.com> wrote:

Hi Aleksey, I have a similar problem. Do you mean that we should use the first 100x reads and discard the rest - is there no way of using all of my reads, now that I've spent money on sequencing it deeper? I think I have somewhere <700x.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460638164 < https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460638164 ,

or mute the thread <

https://github.com/notifications/unsubscribe-auth/AZ9zHWpp3rBJSyvoRVTgxK_fInCAUZjnks5vKYd4gaJpZM4Uqnsz < https://github.com/notifications/unsubscribe-auth/AZ9zHWpp3rBJSyvoRVTgxK_fInCAUZjnks5vKYd4gaJpZM4Uqnsz

.

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://www.genome.umd.edu/ http://masurca.blogspot.com http://masurca.blogspot.com/ — You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460668098 < https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460668098 ,

or mute the thread <

https://github.com/notifications/unsubscribe-auth/AXcg9erf8NNBBLR5cuuXRogvEENdoi6Jks5vKZv2gaJpZM4Uqnsz < https://github.com/notifications/unsubscribe-auth/AXcg9erf8NNBBLR5cuuXRogvEENdoi6Jks5vKZv2gaJpZM4Uqnsz

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460669394 < https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460669394 ,

or mute the thread <

https://github.com/notifications/unsubscribe-auth/AZ9zHVLSEg7tAeHgA6vCX8opECZRON1sks5vKZy9gaJpZM4Uqnsz < https://github.com/notifications/unsubscribe-auth/AZ9zHVLSEg7tAeHgA6vCX8opECZRON1sks5vKZy9gaJpZM4Uqnsz

.

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://www.genome.umd.edu/ http://masurca.blogspot.com http://masurca.blogspot.com/ — You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460676870 < https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460676870 ,

or mute the thread <

https://github.com/notifications/unsubscribe-auth/AXcg9XP4loB-Of4U-4x80l5ky7GhPjAPks5vKaE-gaJpZM4Uqnsz < https://github.com/notifications/unsubscribe-auth/AXcg9XP4loB-Of4U-4x80l5ky7GhPjAPks5vKaE-gaJpZM4Uqnsz

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460685133 < https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460685133 ,

or mute the thread <

https://github.com/notifications/unsubscribe-auth/AZ9zHTbYirapYwMwbpINy06SrzptwXfUks5vKaXIgaJpZM4Uqnsz < https://github.com/notifications/unsubscribe-auth/AZ9zHTbYirapYwMwbpINy06SrzptwXfUks5vKaXIgaJpZM4Uqnsz

.

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://www.genome.umd.edu/ http://masurca.blogspot.com http://masurca.blogspot.com/ — You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460692719 < https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460692719 , or mute the thread <

https://github.com/notifications/unsubscribe-auth/AXcg9SEkBDBY3opYwE2n1VgQAYbR9725ks5vKaovgaJpZM4Uqnsz < https://github.com/notifications/unsubscribe-auth/AXcg9SEkBDBY3opYwE2n1VgQAYbR9725ks5vKaovgaJpZM4Uqnsz

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460796083 < https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460796083 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AZ9zHc6kHpY9uP4jsRyQjWr5Oaa1SpB6ks5vKe0SgaJpZM4Uqnsz < https://github.com/notifications/unsubscribe-auth/AZ9zHc6kHpY9uP4jsRyQjWr5Oaa1SpB6ks5vKe0SgaJpZM4Uqnsz

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460800067>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AXcg9VLcXCH0pFrcUju072y4OcrNU5goks5vKe_dgaJpZM4Uqnsz .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/33#issuecomment-460968250, or mute the thread https://github.com/notifications/unsubscribe-auth/AZ9zHWSoS4YuM2WyBAIxpYd1BAzqMyLOks5vKqoRgaJpZM4Uqnsz .

lly1214 commented 2 years ago

How much coverage do you have? If you have more than 100x Illumina coverage, just use the first 100x and discard the rest. Too much coverage is the main reason for such failures.

dear alekseyzimin: I don't have data greater than 100X, Are there any other causes of this problem? Can this problem be solved by adjusting parameters? I use unmapped reads from multiple individuals to run this assembly and my pair-ends reads file is several hundred gigabytes large , whether it has anything to do with that?
thank you a lot !

lly1214 commented 2 years ago

dear Niwradel : have you solved this problem? and how did you finally solve it ? I met the same problem.

thank you very much.

alekseyzimin commented 2 years ago

Can you restate what your problem is?

On Mon, Jan 17, 2022 at 2:43 AM lly1214 @.***> wrote:

dear Niwradel : have you solved this problem? and how did you finally solve it ? I met the same problem.

thank you very much.

— Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/33#issuecomment-1014225427, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHKS6BSYPA7IR3X2WV3UWPCCHANCNFSM4FFKPMZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com