daler / metaseq

Framework for integrated analysis and plotting of ChIP/RIP/RNA/*-seq data
https://daler.github.io/metaseq
MIT License
87 stars 36 forks source link

MetaSeq on non-traditional files that have already been binned? #20

Closed ghost closed 8 years ago

ghost commented 8 years ago

I have written a new fast medium/diffuse area chip-seq caller in Python. It uses a binning strategy and one of the files it produces contains the island counts:

Chromosome Bin Chip1 Chip2 Input1 Input2
chr1 11550 3 6 1 2
chr1 11600 4 5 0 2
...

I'd love to be able to make it easy to use meta-seq with my chip-seq caller to produce various graphs. With the data above I could regenerate four bed files* so that using meta-seq would be straightforward, but that would be needlessly time-consuming. Therefore I'll look into creating an adapter for data like above.

What I want is a file adapter that iterates over the cols of the dataframe and returns one signal object for each.

What does the genomic_signal objects look like? I imagine it is pretty similar to my table above... I could not find them in the test suite: https://github.com/daler/metaseq/blob/master/metaseq/test/test.py

I guess I could play around metaseq with for half a day and find out, but I am not used to reading OO code so it would be hard for me. Would love some comments on strategy for/ feasibility of creating genomic_signal objects for data like above, if you find have the time.

*For sample 1 I'd make a bed with the following:

chr1 11550 11600
chr1 11550 11600
chr1 11550 11600
chr1 11550 11600
chr1 11600 11650
...
ghost commented 8 years ago

It seems like there are two readers that it might be possible to use:

1) the deseq reader or 2) the dataframe reader.

Can any of these be used as is? If not, what do I need to change about my data to use them?

It seems that I was mistaken about the possibility of creating bed files; my binned files do not contain strand info. It seems like using one of the two readers above or creating my own reader is the most straightforward option.

Better example of my data than the one shown above:

Chromosome  Bin ../data/bed/Exp2_0_Polymerase_II.bed.gz ../data/bed/Exp1_0_Polymerase_II.bed.gz ../data/bed/Exp2_0h_Input.bed.gz    ../data/bed/Exp1_0h_Input.bed.gz
chr1    10000   1   3   4   3
chr1    10050   0   2   1   2
chr1    10100   1   4   4   2
chr1    10150   3   2   4   1
chr1    10200   5   2   2   3
chr1    10250   3   0   1   2
chr1    10300   3   1   2   1
chr1    10350   1   4   1   4
chr1    10450   2   4   2   2

It would not surprise me if this is very close to one of the formats metaseq already knows how to parse.

ghost commented 8 years ago

Come to think of it, metaseq would at least need some gene info for producing the heatmaps.

I'm guessing metaseq needs one array per input file per tss. So for example for Sample1 you'd have number_of_tsses arrays where each array is tss_region_length/bin_length long.

If I create a dict like:

{"tss1": [0, 2, 1, 8...], "tss2": [42, 50, 68, 74...], "tss3": ...}

per data-column (../data/bed/Exp2_0_Polymerase_II.bed.gz, ../data/bed/Exp1_0_Polymerase_II.bed.gz etc.) what would I then need to make those nice tss-plots/heatmaps from example 1? I think I'd like to incorporate the creation of those graphs in my chip-seq caller with a --graph option (reminding people to cite your paper of course).

But if meta-seq has code that can work with the type of data shown in the two previous posts that would be even better. It would have to pick out the bins from tss-regions by itself, though.

daler commented 8 years ago

Hi Endre -

Thanks for taking an interest.

First some background to orient you . . . the main goal of metaseq is to get genomics data into Python data structures (mostly numpy.array and pandas.DataFrame). In the Example 1 case, the input is a BED file of TSSes containing windows to look within and a BAM file. The genomic_signal object is initialized simply by pointing to some interval-based genomic data (BAM, BED, bigBed, bigWig; BAM in this example). Its local_coverage function is responsible for accepting an interval and returning a numpy.array representing the values for that interval (linearly interpolated into bins, if they were also specified). An array is built by querying the genomic data for each interval (here, a TSS provided in the BED file) in parallel. The specific way in which a genomic_signal object extracts data from the underlying file depends on the filetype_adapter.

So really the question is how to get your data format into a numpy.array. Or more generally, how to make this format useful for various downstream tools (not just limited to metaseq)?

If you have chrom, start, end fields as the first 3 columns, then you can use bedtools/pybedtools for interval manipulation and pysam with tabix for indexing. You can also create separate bigWig files for each sample (have to go bedGraph -> bigWig with cut -f 1,2,3,4 and cut -f 1,2,3,5 etc).

chr1    10000   10050   1   3   4   3
chr1    10050   10100   0   2   1   2
chr1    10100   10150   1   4   4   2

Option 1: create one bigWig file for each column, and use metaseq and other downstream analysis tools (like deepTools) as-is.

Option 2: A genomic_signal object could be initialized with this data file and bgzipped and tabix-indexed if needed. The object should also store the column you care about. Then, when its local_coverage is provided with an interval, it does the tabix query to extract all lines intersecting that interval. For each line, it extracts the column you care about. It would be important to call local_coverage with the use_score=True argument to return the score rather than the presence/absence (as is the case when using BAM).

In benchmarks when creating arrays, bigWig is substantially faster when creating arrays from tabixed files. My hunch is that the initial step of bigWig creation will likely be offset by performance gains in downstream processing.

Specifically for a TSS plot, you would have to provide the TSSes to represent each row of the array. It's these intervals that need strand if you want to talk about upstream and downstream. You don't need strand in the underlying data file.

Side note 1: If you end up digging through the code, the perhaps-odd arrangement of the code is for parallel performance -- it was my first heavy use of multiprocessing in a module, and had to organize things this way in order to pickle functions across processes. I'd like to find time to simplify it if possible, but it's low priority at the moment.

Side note 2: I'd love to try out your peak caller (existing ones leave a lot to be desired, esp for broad domains). I can't find it in your repos though!

ghost commented 8 years ago

Thanks for the explanation.

Without reading your paper, just the docs, we got the impression that it was for creating those two very pretty graphs (there are two of us w/different projects looking into using metaseq for those TSS-graphs). I guess there is a lot more to metaseq then and that its main purpose is different/broader. Do I understand it correctly if it is for reading and using genomic data in Python, speedily and easily (with many additional conveniences)?

It seems like creating bigwig-files is the way to go then, especially since that format can then be used with other tools. Hadn't heard about deepseq, but it looks great.

The code looks good to me, but if you are redoing the parallelism, you might want to look into joblib. It has a parallel function with many conveniences, such as sane error messages and cleanup.

I'll send you an e-mail about the ChIP-seq peak caller.

daler commented 8 years ago

Right, it's for getting genomic data into Python. It's a library of tools to build your own visualizations, rather than a CLI to produce canned plots. Enough people ask about a CLI that I may add one, but deepTools is probably better for that kind of task.

You can just scroll through the figures in the paper for more examples -- they are all unmodified, straight from metaseq. There are more in the supplemental PDF, too.

I agree, joblib is the way to go. Memory.cache is really helpful for working with lots of large numpy arrays on disk, too.

ghost commented 8 years ago

Thanks. Meta is a good name, this is a tad abstract.

ghost commented 8 years ago

Btw, there are a few quirks with 'Memory.cache' that in my view makes it less than ideal for inclusion in software published for others to use:

1) uses a lot of storage space 2) cannot delete just parts of the cache 3) impossible to inspect the cache deeply

You should consider telling users how much space the cache is currently taking and how to delete it. I regret not doing so in my software.

daler commented 8 years ago

Hmm, I haven't hit the space problem yet, and I've only been using it for simple cases. Good to know, thanks.