mangsta commented 8 years ago

Hi, I recently ran MITObim on some data. It used the entire 3 day limit allowed on our system but did not finish. The farthest it got was to say "running mapping assembly using MIRA". Any ideas why? I used SBATCH to run MITOBIM.

SBATCH:

!/bin/bash

SBATCH --time=3-00:00:00 # walltime

SBATCH --ntasks=24 # number of processor cores (i.e. tasks)

SBATCH --mem=256GB # total memory

~/scripts/MITObim_1.8.pl -start 1 -end 200 -sample amaranth -ref HaloxylonCp -readpool filtered.fastq --redirect_tmp /tmp -maf initial-mapping-AmaranthCCSPacBio-to-HaloChloroplast_assembly/initial-mapping-AmaranthCCSPacBio-to-HaloChloroplast_d_results/initial-mapping-AmaranthCCSPacBio-to-HaloChloroplast_out.maf &> log

OUTPUT:

Full command run: /fslhome/mangsta/scripts/MITObim_1.8.pl -start 1 -end 200 -sample amaranth -ref HaloxylonCp -readpool filtered.fastq --redirect_tmp /tmp -maf initial-mapping-AmaranthCCSPacBio-to-HaloChloroplast_assembly/initial-mapping-AmaranthCCSPacBio-to-HaloChloroplast_d_results/initial-mapping-AmaranthCCSPacBio-to-HaloChloroplast_out.maf

All paramters seem to make sense: startiteration: 1 enditeration: 200 sample: amaranth refname: HaloxylonCp readpool: /panfs/pan.fsl.byu.edu/scr/usr/31/mangsta/filtered.fastq maf: /panfs/pan.fsl.byu.edu/scr/usr/31/mangsta/initial-mapping-AmaranthCCSPacBio-to-HaloChloroplast_assembly/initial-mapping-AmaranthCCSPacBio-to-HaloChloroplast_d_results/initial-mapping-AmaranthCCSPacBio-to-HaloChloroplast_out.maf quick: 0 paired: 0 (off=0, on=1) assembly mode: 0 (mapping=0, denovo=1) verbose: 0 (off=0, on=1) split: 0 (off=0, on=1) minimum avg. coverage: 0 (off=0) clean: 0 (off=0, on=1) trim reads: 0 (off=0, on=1) trim overhang: 0 (no=0, yes=1) platform: SOLEXA kmer baiting: 31 number of missmatches in mapping assembly: default (15% of average read length loaded) proofreading: off

Starting MITObim

ITERATION 1

Jan 10 15:00:55

recover backbone by running miraconvert on maf file

reference is too long for mirabait to be handled in one go -> will be split into sub-sequences

fishing readpool using mirabait (k = 31)

running mapping assembly using MIRA

chrishah commented 8 years ago

Hi,

Organellar read coverage from Illumina data is sometimes really high, which slows down the assembly considerably. Hard to say if this is the case here without more info about your data, but this is probably where I would start looking. I would try to downsample your data to ~100x coverage. I am assuming you have successfully completed an initial mapping assembly with MIRA and that this also took a while? You should be able to calculate the appropriate size of a read subsample that you would need for ~100x coverage based on the average coverage that you got for the initial mapping assembly based on the complete set of reads. Find the average coverage you got from the initial assembly by looking into the file *_info_contigstats.txt in the *_d_info directory. So, if you had 500x coverage I would go for a random subset of ~1/5 of your data.

Hope that helps! cheers, Christoph

mangsta commented 8 years ago

Thanks, I will try that and get back to you! Ryan

On Wednesday, January 13, 2016 12:51 PM, Christoph Hahn <notifications@github.com> wrote:

Hi,Organellar read coverage from Illumina data is sometimes really high, which slows down the assembly considerably. Hard to say if this is the case here without more info about your data, but this is probably where I would start looking. I would try to downsample your data to ~100x coverage. I am assuming you have successfully completed an initial mapping assembly with MIRA and that this also took a while? You should be able to calculate the appropriate size of a read subsample that you would need for ~100x coverage based on the average coverage that you got for the initial mapping assembly based on the complete set of reads. Find the average coverage you got from the initial assembly by looking into the file _info_contigstats.txt in the _d_info directory. So, if you had 500x coverage I would go for a random subset of ~1/5 of your data.Hope that helps! cheers, Christoph— Reply to this email directly or view it on GitHub.

mangsta commented 8 years ago

Christoph, I looked at the contigstats.txt, and it showed that my coverage was below 500X.

name length av.qual #-reads mx.cov. av.cov GC% CnIUPAC CnFunny CnN CnX CnGap CnNoCov

NC_027668.1_bb 153686 74 1132 118 26.99 36.44 1979 0 0 0 21096 0 ~

Any ideas where I should go from here? Thanks,Ryan

On Wednesday, January 13, 2016 12:51 PM, Christoph Hahn <notifications@github.com> wrote:

Hi,Organellar read coverage from Illumina data is sometimes really high, which slows down the assembly considerably. Hard to say if this is the case here without more info about your data, but this is probably where I would start looking. I would try to downsample your data to ~100x coverage. I am assuming you have successfully completed an initial mapping assembly with MIRA and that this also took a while? You should be able to calculate the appropriate size of a read subsample that you would need for ~100x coverage based on the average coverage that you got for the initial mapping assembly based on the complete set of reads. Find the average coverage you got from the initial assembly by looking into the file _info_contigstats.txt in the _d_info directory. So, if you had 500x coverage I would go for a random subset of ~1/5 of your data.Hope that helps! cheers, Christoph— Reply to this email directly or view it on GitHub.

chrishah commented 8 years ago

Hi Ryan, Right, so coverage seems not to be an issue. I looked at the command you reported in your initial post and I was wondering: Are you trying to use MITObim with Pacbio data?

If yes, I think this may account for the long runtime especially if you're processing uncorrected PB data because this data is very noisy. MITObim is currently set up to handle Illumina data and the MIRA assembly in the pipeline is tuned for the particular properties of Illumina data. MIRA can handle corrected PacBio data (uncorrected data is discouraged), but the MITObim script will need some modifications. I can make some suggestions if you want.

Did you tell MIRA you were handling Pacbio data in your manifest file for the initial mapping assembly? How long did this assembly take?

cheers, Christoph

mangsta commented 8 years ago

Hi Christoph, Thanks, I am open to any and all suggestions. Yes I am using Pacbio data. I am not sure what corrected means but I have split the reads into smaller sections so that they are shorter than the 29,990 maximum read length. I can't remember for sure how long the original assembly took but I think that it was at least 3 hours. I have attached below my original manifest.conf file.

manifest file for basic mapping assembly with illumina data using MIRA 4

project = initial-mapping-AmaranthCCSPacBio-to-HaloChloroplast job=genome,mapping,accurate

parameters = -DI:trt=/tmp -GE:not=20 -NW:mrnl=0:cac=warn PCBIOHQ_SETTINGS -CL:pec=yes -CO:mrpg=5

readgroup is_reference data = reference.fa strain = HaloxylonCp

readgroup = PacBio data = filtered.fastq technology = pcbiohq strain = amaranth

Thanks,Ryan

On Thursday, January 21, 2016 9:08 AM, Christoph Hahn <notifications@github.com> wrote:

Hi Ryan, Right, so coverage seems not to be an issue. I looked at the command you reported in your initial post and I was wondering: Are you trying to use MITObim with Pacbio data? If yes, I think this may account for the long runtime especially if you're processing uncorrected PB data because this data is very noisy. MITObim is currently set up to handle Illumina data and the MIRA assembly in the pipeline is tuned for the particular properties of Illumina data. MIRA can handle corrected PacBio data (uncorrected data is discouraged), but the MITObim script will need some modifications. I can make some suggestions if you want.Did you tell MIRA you were handling Pacbio data in your manifest file for the initial mapping assembly? How long did this assembly take?cheers, Christoph— Reply to this email directly or view it on GitHub.

mangsta commented 8 years ago

Any additional suggestions?

mangsta commented 8 years ago

How would you define corrected Pacbio data?

mangsta commented 8 years ago

What MITObim scripts modifications do you suggest? I am using CCS reads that have been modified to be less than 29,000 bps.

darencard commented 8 years ago

PacBio data has a very high (relatively speaking) error rate, which makes it difficult to work with when mapping and assembling, but is obviously useful for things like scaffolding in its raw, uncorrected state. Corrected PacBio data is data where the errors have been corrected using sequencing from a less error-prone platform like Illumina. One can essentially map high (>10-20x) Illumina reads onto their PacBio reads and have confidence to infer 'error free' PacBio data, though there is probably a bit more to it than that in pipelines that do it (but I'm sure you get the idea). Of if you have sufficiently high PacBio sequence coverage, you can use the fact that errors are random to self-correct.

chrishah commented 8 years ago

Hello,

Sorry for the long delay and thanks to @darencard for the helpful comment. I've modified the latest MITObim version - it can handle Pacbio data now. Simply specify --platform pacbio in your command. It's not tested thoroughly so use at own risk. I'd appreciate if you'd report back on any particular success/problems you encountered. Thanks!

A few words of advice: As @darencard pointed out the error rate of PacBIo raw data is quite high, which makes it difficult to handle for assemblers, unless it's corrected/filtered properly. MIRA is currently only set up to handle 'high quality PacBio data' with a reasonable error rate. Such data you can get by correcting your long Pacbio reads with Illumina data (here is Biostar discussion on potential tools to use for correction). However, a subset of your Pacbio data is already of very high quality. Pacbio produces so-called CCS (circular consensus) sequences, i.e. consensus sequences that are produced when sequencing the same molecule several time. Sequences obtained from relatively small DNA fragments (a few kb) will in many cases be supported by several passes of sequencing and will thus be of high quality. These aren't going to be the longest reads in your PB read length distribution, but they are massive compared to your default Illumina sequence, so well worth using for Organelle assembly. If you want to bypass the tedious correction process using Illumina data for a first test with MITObim I would suggest to limit yourself to only high quality CCS reads for now, i.e. CCS that have gone through >5 passes of sequencing. This should bring the error rate down to almost Illumina levels. There are several ways and means by which you can get your hands on such data: (1) run the raw data through SMRT-analysis and specify a minimum of 5 passes. (2) use command line tool like pbh5tools to extract from raw data. (3) ask your sequencing provider.

Hope that helps! Good luck!

cheers, Christoph

chrishah / MITObim

Run not finishing #15

!/bin/bash

SBATCH --time=3-00:00:00 # walltime

SBATCH --ntasks=24 # number of processor cores (i.e. tasks)

SBATCH --mem=256GB # total memory

ITERATION 1

name length av.qual #-reads mx.cov. av.cov GC% CnIUPAC CnFunny CnN CnX CnGap CnNoCov

manifest file for basic mapping assembly with illumina data using MIRA 4