How perform hybrid assembly with illumina and sanger reads?

seoanezonjic commented 3 years ago

Hi @bachev I want to get the sequence of a plasmid in a project that has paired end illumina reads and sanger reads (ab1 files that I convert to scf files as recommended by the manual). The problem is that I don't config correcly the manifest file to use the scf files. Can you give me a manifest example of paired illumina reads combined with scf files for hybrid assembly?

Thank you in advance

Pedro Seoane

bachev commented 3 years ago

An oversight of mine while putting together the MIRA 5 docs a couple of years ago: I removed the ability to analyse SCF files. Sorry for that.

Transform ab1 files to fasta+qual or fastq (don't forget vector clipping). Specify the naming scheme of forward/reverse in the manifest (if needed rename the reads to follow a wished naming scheme). The rest of the manifest should be more or less standard.

seoanezonjic commented 3 years ago

Hi @bachev In this manual, the scf files are still mentioned: http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.pdf Also, the Manifest.C code has functions checking for scf files and the output of MIRA saids sonething about that scf info are not found for fastq files when scanned read group is sanger. These mentions lead me to think that the scf format is used by MIRA.

I've tried your suggestions an here my sanger manifest: readgroup = sanger_reads data = sanger_data/sanger_F.fastq sanger_data/sanger_R.fastq segment_placement = ---> <--- technology = sanger

MIRA gives me a paired read error, so I suspect that my manifest is incorrect. I've read about segment_naming and I think that i must generate only one fastq, but I don't know how to specify the reverse and forward reads: The current names of the reads are: 14M_SH12VF_M13F_primer (forward) 11M_SH12IR_M13R_primer (reverse) I'm afraid that this naming is no supported by MIRA. What naming scheme could use in this case? I highly appreciate If you are able to suggest me the correct sanger readgroup manifest in this case. Thank you in advance Pedro

bachev commented 3 years ago

You will want to read the section about segment naming: http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect_ref_manifest_readgroups_segname

You will need to rename your reads. The easiest schemes will probably be TIGR, Forward/Reverse or Solexa.

E.g., should you want to go with TIGR, make sure to rename the above reads to

14M_SH12VF_M13TF
11M_SH12IR_M13TR

(you can do this easily and automatically with sed)

Then add the line segment_naming=tigr to the manifest.

seoanezonjic commented 3 years ago

I've followed your indications and the assembly is performed but it ends with the following error when I use only sanger reads:

Extraction failed, please consult MyFirstAssembly_assembly/MyFirstAssembly_d_results/ec.log
for more information on why.
Large contigs could not be extracted, sorry.

DON'T PANIC! In the directory
    'MyFirstAssembly_assembly/MyFirstAssembly_d_results'
you will find a script called   'extractLargeContigs.sh'
which shows you how to extract large contigs from your assembly. Eitherread it to understand how it's done or ... just run it :-)

I've checked MyFirstAssembly_assembly/MyFirstAssembly_d_results/ec.log and the output is the following:

Parsing special MIRA parameters: --job=genome,denovo,accurate,Sanger -NW:cmrnl=no
Ok.
Loading from maf, saving to: caf fasta maf wig fastaqual

Fatal error (may be due to problems of the input data or parameters):

********************************************************************************
* MAF file MyFirstAssembly_out.maf not found for loading.                      *
********************************************************************************
->Thrown: void MAFParse::registerFile(const std::string & fileName)

Also, I've executed the sh mentioned in the first error, and the output is as follows:

This is how MIRA extracted large contigs from the total assembly for you:
it made a list of contigs (see info file ../MyFirstAssembly_d_info/MyFirstAssembly_info_largecontigs.txt)
which reached a certain length (usually 500, see -MI:lcs=...) and had at least 1/3 of
the average coverage (per sequencing technology).

Note that you can redefine what large contigs are for you by simply using any
combination of -n, -x, -y and -z parameters of miraconvert instead of only the
parameter as used in this example.
You can follow the progress of the conversion in the file ec.log

./extractLargeContigs.sh: line 16: 17610 Trace/breakpoint trap   (core dumped) miraconvert -t caf -t maf -t wig -t fasta -n ../MyFirstAssembly_d_info/MyFirstAssembly_info_largecontigs.txt -P "--job=genome,denovo,accurate,Sanger -NW:cmrnl=no" MyFirstAssembly_out.maf MyFirstAssembly_LargeContigs_out > ec.log 2>&1
Parsing special MIRA parameters: --job=genome,denovo,accurate,Sanger -NW:cmrnl=no
Ok.
Loading from maf, saving to: caf fasta maf wig fastaqual

Fatal error (may be due to problems of the input data or parameters):

********************************************************************************
* MAF file MyFirstAssembly_out.maf not found for loading.                      *
********************************************************************************
->Thrown: void MAFParse::registerFile(const std::string & fileName)

Ooops, something went wrong. Please consult the file ec-log in this directory.

What is the reason of this error using only sanger reads? When I add the Illumina read group, this error is not shown.

Thank you in advance Pedro Seoane

bachev commented 3 years ago

It could be the Sanger reads are of comparatively low coverage, and therefore MIRA built only small contigs which are considered "junk" in the context of genome assemblies.

Illumina brings in the coverage, helping the combined assembly. Look through the assembly log to find out about average coverages present per sequencing technology.

bachev / mira

How perform hybrid assembly with illumina and sanger reads? #7