JetBrains-Research / span

SPAN Semi-supervised Peak Analyzer
https://doi.org/10.1093/bioinformatics/btab376
MIT License
9 stars 1 forks source link

Multi-sample labelling #28

Closed jessakay closed 4 years ago

jessakay commented 4 years ago

Thank you for sharing this awesome tool! I've had some success in following your very thorough tutorial but was concerned about the example being restricted to a single cell type (monocytes from different donors?)

If I would like to have a single model for several cell types with very different epigenomic landscapes (and thus limited overlap in their enriched regions), am I correct in assuming that I should try my best to just label patterns conserved across all samples?

Additionally, what are your suggestions if I would like perform peak calling in which the regime of a mark's distribution drastically changes (e.g., broadly distributed in one condition, peaky in another, and possibly also mixed)? Would it still be possible to have one model in this case?

olegs commented 4 years ago

Dear jessakay, thanks for the warm words about SPAN and for the question.

If I would like to have a single model for several cell types with very different epigenomic landscapes (and thus limited overlap in their enriched regions), am I correct in assuming that I should try my best to just label patterns conserved across all samples?

Indeed in the tutorial we used single cell line - monocytes from different donors, the main idea here is to be able to create labels conforming tracks all together. The main idea of SPAN semi-supervised approach is that it builds separate statistical models for each track individually and uses labels to find the best meta parameters for each track independently. So in the tutorial we focused on a single cell line only to make labelling procedure easier.

Additionally, what are your suggestions if I would like perform peak calling in which the regime of a mark's distribution drastically changes (e.g., broadly distributed in one condition, peaky in another, and possibly also mixed)? Would it still be possible to have one model in this case?

Theoretically this will work out for you. Since each track will rely on its own statistical data model.

By the way you can easily check the main characteristics of resulting peak calling just from the JBR Genome Browser - just select all the tracks, and use context menu and click on “About Tracks” menu item to view the information. It will show you the number of peaks, total length, minimal, maximal, average and median peak size for all the samples. Another handy option is to check overlap of your results by clicking “Overlap info” from the context menu. It will show you the heat map of overlaps, I.e. the fraction of peaks overlapping others.

jessakay commented 4 years ago

Thank you for the explanations! They were really helpful