YeoLab / flotilla

Reproducible machine learning analysis of gene expression and alternative splicing data
http://yeolab.github.io/flotilla/docs
BSD 3-Clause "New" or "Revised" License
121 stars 26 forks source link

Support arbitrary genome location data for variants, intervals, RNA editing, etc #145

Open olgabot opened 9 years ago

olgabot commented 9 years ago

E.g. have the feature ids be pybedtools.Interval objects or something. Or some other kind of multiindex?

I'm thinking of this for methylation data and binding data.

@gpratt @mlovci any input?

olgabot commented 9 years ago

oh and editing, @boyko-

olgabot commented 9 years ago

@daler @yarden you have experience with these arbitrary genome location formats, what do you think?

olgabot commented 9 years ago

i.e. what do you think of using gffutils/pybedtools vs pandas.MultiIndex for the indices of a (samples x features) matrix?

gpratt commented 9 years ago

The majority of questions your asking here will need intersections / unions / bedtools like functionality. The concept of indexing is difficult to integrate. Making it easy to do round trips between pybedtools / pandas would be optimal. Ryan has it half way there with the to_dataframe function. We just need to subclass pandas for an easy conversion back.

Gabriel Pratt Bioinformatics Graduate Student, Yeo Lab University of California San Diego

On Mon, Nov 10, 2014 at 3:07 PM, Olga Botvinnik notifications@github.com wrote:

i.e. what do you think of using gffutils/pybedtools vs pandas.MultiIndex for the indices of a (samples x features) matrix?

— Reply to this email directly or view it on GitHub https://github.com/YeoLab/flotilla/issues/145#issuecomment-62473136.

daler commented 9 years ago

Yeah, I guess it depends on what you want to do in the end. pandas.MultiIndex gets you indexing for free if all you care about is essentially a tuple lookup. But like Gabriel says if you're after interval ops then pybedtools would be the way to go.

This is sort of the problem I was trying to solve over in metaseq. In that case, the pandas.DataFrame index is feature ID, and any other information you want (e.g., genomic coords) is a lookup in the gffutils sqlite database. For example,

def generator():
    for gene_id in df.index:
        yield gffutils.asinterval(db[gene_id])
features = pybedtools.BedTool(generator())

But that assumes GFF/GTF, and here you're asking about arbitrary formats like BED, VCF, etc. In that case the pandas/pybedtools interconversion isn't bad:

a = pybedtools.example_bedtool('a.bed')
d = a.to_dataframe()
fn = pybedtools.BedTool._tmp()
d.to_csv(fn, sep='\t', header=False, index=False)
a2 = pybedtools.BedTool(fn)

Then it's a matter of figuring out if you want to store all the fields in the interval file as part of the multiindex, or just chrom/start/stop.

olgabot commented 9 years ago

@gpratt , @daler thanks for your input. This gets even more complicated with spicing events, which often involve multiple intervals. I think ultimately interval operations is what we want, and metaseq has a lot of that framework already built in, so I think we should work together in the future :)

The intervals get even more complicated with splicing events, which often involve multiple intervals as defining one event. How would we do this using metaseq or pybedtools?

daler commented 9 years ago

It really depends on how events are stored and what you want to do with them.

If you had a splice event with an ID like gene1|exon1:exon3 and exons were defined in a gffutils db, you could use the metaseq strategy out-of-the-box to do the lookup and get intervals.

On the other hand if your splice events were stored as IDs like chr1:1-100|chr1:300-400 then they are already in interval format and there's no need for an ID lookup. So you would use the pybedtools strategy.

In either case you would need to decide what exactly you wanted to do with the intervals. Do you have a concrete example in mind?

olgabot commented 9 years ago

It sounds like either way, there needs to be a gffutils type backend for gene id lookups, because genes are still going to have their gene ids, and we'll need to match up arbitrary intervals with nearby genes. Sounds like we'd also need to use pybedtools, for example, to check for whether the upstream 500nt of a particular subset of splicing events overlap with the binding data of some RNA binding protein.

The current example I have in mind is filtering splicing events on the gene expression, which right now requires a mapping of that splicing event ID (in the chr1:1-100@chr1:200-300@chr1:400-500 format you mentioned above) to an ensembl gene id, and this ensembl gene name must be the column names of Study.expression.data. I'm working on this feature right now and it is quite fragile as not all splicing events have gene ids in my database, and relies on a lot of input from the user for this mapping.

mlovci commented 9 years ago

@olgabot I think that this problem might be better solved by putting your gene IDs along with the feature metadata. Running gffutils all the time is expensive.

We want to be able to represent the genome in a way that's good for ML and more basic than just this splicing problem.

For example, if it's SNPs (vcf) the data should be samples X nucleotide calls (A, T, C, G encoded -> 1-4?), but if it's coverage data (bam/bed/....) it can be encoded as a samples X sum (or some other user-defined function) over possibly a sliding window or in the case of your splicing data, user-defined bins. I think @daler's suggestion of picking a use-case is important. Can't engineer for everything at once.

Maybe @gpratt knows some good visualizations for different types of genomic ranges that are frequently performed?

mlovci commented 9 years ago

Also, there's good viz and use patterns here: https://github.com/fidelram/deepTools

yarden commented 9 years ago

My solution to these types of issues has been to encode the gene_id into the GFF and propagate it to all the child nodes. GFF is difficult to work with in part because it represents a hierarchy but does not give you a single identifier that links parent and child nodes. I implemented something in gffutils that "sanitizes" a GFF by propagating the gene ID through the gene, mRNA and exon entries.

sanitizing a GFF: http://pythonhosted.org/gffutils/autodocs/gffutils.helpers.sanitize_gff_db.html

overlapping a GFF with a gene table to get gene ids: https://github.com/yarden/rnaseqlib/blob/clip/rnaseqlib/gff/gff_annotate_events.py

This makes the files greppable by gene id and simplifies downstream analyses.

On Nov 12, 2014, at 2:45 PM, Olga Botvinnik notifications@github.com wrote:

It sounds like either way, there needs to be a gffutils type backend for gene id lookups, because genes are still going to have their gene ids, and we'll need to match up arbitrary intervals with nearby genes. Sounds like we'd also need to use pybedtools, for example, to check for whether the upstream 500nt of a particular subset of splicing events overlap with the binding data of some RNA binding protein.

The current example I have in mind is filtering splicing events on the gene expression, which right now requires a mapping of that splicing event ID (in the chr1:1-100@chr1:200-300@chr1:400-500 format you mentioned above) to an ensembl gene id, and this ensembl gene name must be the column names of Study.expression.data. I'm working on this feature right now and it is quite fragile as not all splicing events have gene ids in my database, and relies on a lot of input from the user for this mapping.

— Reply to this email directly or view it on GitHub.