Closed ghost closed 8 years ago
It seems like there are two readers that it might be possible to use:
1) the deseq reader or 2) the dataframe reader.
Can any of these be used as is? If not, what do I need to change about my data to use them?
It seems that I was mistaken about the possibility of creating bed files; my binned files do not contain strand info. It seems like using one of the two readers above or creating my own reader is the most straightforward option.
Better example of my data than the one shown above:
Chromosome Bin ../data/bed/Exp2_0_Polymerase_II.bed.gz ../data/bed/Exp1_0_Polymerase_II.bed.gz ../data/bed/Exp2_0h_Input.bed.gz ../data/bed/Exp1_0h_Input.bed.gz
chr1 10000 1 3 4 3
chr1 10050 0 2 1 2
chr1 10100 1 4 4 2
chr1 10150 3 2 4 1
chr1 10200 5 2 2 3
chr1 10250 3 0 1 2
chr1 10300 3 1 2 1
chr1 10350 1 4 1 4
chr1 10450 2 4 2 2
It would not surprise me if this is very close to one of the formats metaseq already knows how to parse.
Come to think of it, metaseq would at least need some gene info for producing the heatmaps.
I'm guessing metaseq needs one array per input file per tss. So for example for Sample1
you'd have number_of_tsses
arrays where each array is tss_region_length/bin_length
long.
If I create a dict like:
{"tss1": [0, 2, 1, 8...], "tss2": [42, 50, 68, 74...], "tss3": ...}
per data-column (../data/bed/Exp2_0_Polymerase_II.bed.gz
, ../data/bed/Exp1_0_Polymerase_II.bed.gz
etc.)
what would I then need to make those nice tss-plots/heatmaps from example 1? I think I'd like to incorporate the creation of those graphs in my chip-seq caller with a --graph
option (reminding people to cite your paper of course).
But if meta-seq has code that can work with the type of data shown in the two previous posts that would be even better. It would have to pick out the bins from tss-regions by itself, though.
Hi Endre -
Thanks for taking an interest.
First some background to orient you . . . the main goal of metaseq is to get genomics data into Python data structures (mostly numpy.array and pandas.DataFrame). In the Example 1 case, the input is a BED file of TSSes containing windows to look within and a BAM file. The genomic_signal
object is initialized simply by pointing to some interval-based genomic data (BAM, BED, bigBed, bigWig; BAM in this example). Its local_coverage
function is responsible for accepting an interval and returning a numpy.array representing the values for that interval (linearly interpolated into bins, if they were also specified). An array is built by querying the genomic data for each interval (here, a TSS provided in the BED file) in parallel. The specific way in which a genomic_signal
object extracts data from the underlying file depends on the filetype_adapter
.
So really the question is how to get your data format into a numpy.array. Or more generally, how to make this format useful for various downstream tools (not just limited to metaseq)?
If you have chrom, start, end fields as the first 3 columns, then you can use bedtools/pybedtools for interval manipulation and pysam with tabix for indexing. You can also create separate bigWig files for each sample (have to go bedGraph -> bigWig with cut -f 1,2,3,4
and cut -f 1,2,3,5
etc).
chr1 10000 10050 1 3 4 3
chr1 10050 10100 0 2 1 2
chr1 10100 10150 1 4 4 2
Option 1: create one bigWig file for each column, and use metaseq and other downstream analysis tools (like deepTools) as-is.
Option 2: A genomic_signal
object could be initialized with this data file and bgzipped and tabix-indexed if needed. The object should also store the column you care about. Then, when its local_coverage
is provided with an interval, it does the tabix query to extract all lines intersecting that interval. For each line, it extracts the column you care about.
It would be important to call local_coverage
with the use_score=True
argument to return the score rather than the presence/absence (as is the case when using BAM).
In benchmarks when creating arrays, bigWig is substantially faster when creating arrays from tabixed files. My hunch is that the initial step of bigWig creation will likely be offset by performance gains in downstream processing.
Specifically for a TSS plot, you would have to provide the TSSes to represent each row of the array. It's these intervals that need strand if you want to talk about upstream and downstream. You don't need strand in the underlying data file.
Side note 1: If you end up digging through the code, the perhaps-odd arrangement of the code is for parallel performance -- it was my first heavy use of multiprocessing in a module, and had to organize things this way in order to pickle functions across processes. I'd like to find time to simplify it if possible, but it's low priority at the moment.
Side note 2: I'd love to try out your peak caller (existing ones leave a lot to be desired, esp for broad domains). I can't find it in your repos though!
Thanks for the explanation.
Without reading your paper, just the docs, we got the impression that it was for creating those two very pretty graphs (there are two of us w/different projects looking into using metaseq for those TSS-graphs). I guess there is a lot more to metaseq then and that its main purpose is different/broader. Do I understand it correctly if it is for reading and using genomic data in Python, speedily and easily (with many additional conveniences)?
It seems like creating bigwig-files is the way to go then, especially since that format can then be used with other tools. Hadn't heard about deepseq, but it looks great.
The code looks good to me, but if you are redoing the parallelism, you might want to look into joblib. It has a parallel
function with many conveniences, such as sane error messages and cleanup.
I'll send you an e-mail about the ChIP-seq peak caller.
Right, it's for getting genomic data into Python. It's a library of tools to build your own visualizations, rather than a CLI to produce canned plots. Enough people ask about a CLI that I may add one, but deepTools is probably better for that kind of task.
You can just scroll through the figures in the paper for more examples -- they are all unmodified, straight from metaseq. There are more in the supplemental PDF, too.
I agree, joblib
is the way to go. Memory.cache
is really helpful for working with lots of large numpy arrays on disk, too.
Thanks. Meta is a good name, this is a tad abstract.
Btw, there are a few quirks with 'Memory.cache' that in my view makes it less than ideal for inclusion in software published for others to use:
1) uses a lot of storage space 2) cannot delete just parts of the cache 3) impossible to inspect the cache deeply
You should consider telling users how much space the cache is currently taking and how to delete it. I regret not doing so in my software.
Hmm, I haven't hit the space problem yet, and I've only been using it for simple cases. Good to know, thanks.
I have written a new fast medium/diffuse area chip-seq caller in Python. It uses a binning strategy and one of the files it produces contains the island counts:
I'd love to be able to make it easy to use meta-seq with my chip-seq caller to produce various graphs. With the data above I could regenerate four bed files* so that using meta-seq would be straightforward, but that would be needlessly time-consuming. Therefore I'll look into creating an adapter for data like above.
What I want is a file adapter that iterates over the cols of the dataframe and returns one signal object for each.
What does the
genomic_signal
objects look like? I imagine it is pretty similar to my table above... I could not find them in the test suite: https://github.com/daler/metaseq/blob/master/metaseq/test/test.pyI guess I could play around metaseq with for half a day and find out, but I am not used to reading OO code so it would be hard for me. Would love some comments on strategy for/ feasibility of creating
genomic_signal
objects for data like above, if you find have the time.*For sample 1 I'd make a bed with the following: