berman-lab / ymap

YMAP - Yeast Mapping Analysis Pipeline : An online pipeline for the analysis of yeast genomic datasets.
MIT License
6 stars 6 forks source link

Genome install extra step / repetitiveness plot figure? #59

Open darrenabbey opened 7 years ago

darrenabbey commented 7 years ago

Genome repetitiveness calculation during genome install is unnecessary. The calculation was for investigational purposes early on in the process of writing YMAP. It was subsequently found to not be a useful metric for analysis.

Removing this calculation will save a significant time on installation of a new genome.

darrenabbey commented 7 years ago

Commenting out or removing lines [19-20; 114-127; 210-225; 247-261; 285-300] from script "Ymap_root/scripts_genomes/genome.install_6.sh" should affect this change.

vladimirg commented 7 years ago

@darrenabbey , do you think it can be useful? For example, Phytophthora infestans is quite repetitive, and I wonder if we can do something with those regions (maybe ignore them in analysis?).

darrenabbey commented 7 years ago

There is the potential for it to be useful. In some genomes repetitiveness analysis can help reveal where centromeres are, for example.

In the C. albicans genome, repetitiveness correlates strongly with GC bias. Not an exact correlation, but pretty strong.

That said, the analysis isn't used in the construction of any figure types at this time. It might be worthwhile to discuss adding such a figure type.

darrenabbey commented 7 years ago

The reason I included repetitiveness analysis was that it appeared to correspond to a prominent CNV noise signal in a lot of C. albicans datasets. That noise was better matched by GC bias, which has a better rational behind it as well. Thus, the option of GC-correction was setup, but not repetitiveness-correction.

darrenabbey commented 7 years ago

The whole chromosome repetitiveness plots I found most useful included a simple trace of the analysis, smoothed to limit the visual noise. While viewing much smaller regions, lesser smoothing was needed. For consistency with everything else, chromosome cartoon outlines would be used.

I remember plotting the smoothed trace such that the median height value was placed near the bottom quarter of the figure, with the y-range being sufficient to capture the max heights in the trace.

Providing units to the y-axis would be problematic. Perhaps it could be described as a "repetitiveness index" to avoid needing a specific unit.

darrenabbey commented 7 years ago

I think it was in Cryptococcus where I localized the centromeres by examining repetitiveness traces because I was having a hard time finding coordinates for them in the databases.

darrenabbey commented 7 years ago

I do have a much faster version of code for doing the repetitiveness processing, using a bit of a different algorithmic approach. I've been working on this offline with respect to Ymap. The final output files are bit-identical to what is in Ymap right now, so it shouldn't take too long to integrate it.