JaneliaSciComp / msg

Multiplexed Shotgun Genotyping
http://genomics.princeton.edu/AndolfattoLab/MSG.html
11 stars 12 forks source link

Solve CSV field error #45

Open fdchevalier opened 2 years ago

fdchevalier commented 2 years ago

Hi MSG team,

First, thank you for developing/maintaining this pipeline.

We have generated F2 hybrids from a cross between two species. Unfortunately we got scarce data and I am using this pipeline to rebuild genotypes. The genome of these species is ~380 Mb in size with 8 chromosomes (80 Mb max). The current assembly has most of the scaffolds assembled into complete chromosomes and additional tiny scaffolds.

When running the pipeline, I got an error when extracting the reference alleles after filtering for common reads. The error was _csv.Error: field larger than field limit (131072) which is generated when a field exceeds 128 KB of text. While the exact origin of this error is unknown, this could be due to high number of variants on a long chromosome. The proposed solution sets the limit to the maximum possible value.

Let me know if you need more information.

dstern commented 2 years ago

This script is run only if you set pepthresh in msg.cfg. We have rarely used this option and therefore not entirely debugged. It is not necessary for pipeline. Recommend commenting out pepthresh and seeing if pipeline completes. Also, if you have many small scaffolds, probably not useful to run all scaffolds. Huge I/O burden to running all scaffolds. You will find pipeline runs MUCH faster if you ignore small scaffolds. In msg.cfg, set chroms=all to the chromosomes you want to analyze. For example, in Drosophila, chroms = 2L,2R,3L,3R,4,X

fdchevalier commented 2 years ago

Thank you @dstern for the quick answer.

I indeed use pepthresh because it is in the toy example. I don't know what this does and it is not documented in the readme. So if this is not critical/useful, might be good to comment it in the example.

Regarding the small scaffolds, I should probably remove (some of) them indeed. We still have some of decent size (N90 is 25 Mb). So what would be your recommendation based on size.

Thank you.

dstern commented 2 years ago

Don't remove the sequences from your genome files. You want the full sequence available for read mapping. But just specify the chromosomes to analyze. Which size to include depends on your genome and question. Probably just want to do some trial and error.

fdchevalier commented 2 years ago

Sorry for the bad wording. My plan was to remove the scaffolds from the analysis by following your advice, not from the genome. I will play with the pipeline settings then. Thank you.