Make processed samples before combining available

aaronkw commented 4 years ago

Context

We're interested in using the compendia data from refinebio for making functional networks, which requires generating gene correlations from individual datasets.

Problem or idea

For our needs, we don't want values of missing genes to be imputed, or the quantile normalization across samples.

Solution or next step

Would it be possible to make the compendia data available before the 'Combined Matrix' step? That is, microarray and RNA-seq samples that are only individually processed, which can then be grouped into datasets (would be awesome if we could download them grouped by dataset :)).

kurtwheeler commented 4 years ago

Hi Aaron!

This sounds exciting! I think that we can get you what you want without too much effort, but not zero effort.

So what you're actually looking for sounds a lot like our standard datasets aggregated by experiment. I created one such dataset here. It has one microarray experiment and one RNA-Seq experiment in it. You can download it and see if that's more or less what you're looking for. (Depending on when you go check, you may have to regenerate it.)

If it is, then there's only a few changes we'll need to make:

@cgreene mentioned that you'll want the samples in each experiment combined using an outer rather than inner join. This is where we do that: https://github.com/AlexsLemonade/refinebio/blob/dev/workers/data_refinery_workers/processors/smasher.py#L112. We already have a couple options on our dataset objects such as aggregate_by, transformation, and whether or not to skip quantile normalization. I think what type of join to use could be added as an additional option
We'll need something to create the dataset object representing the compendium you want. We have a very similar command here: https://github.com/AlexsLemonade/refinebio/blob/dev/foreman/data_refinery_foreman/foreman/management/commands/create_compendia.py. A new command could be added to generate a dataset for this type of compendium.

Does that all make sense? Does the dataset I linked you to look like it has the right structure?

We don't have any spare dev resources at the moment to make those changes, but if you were able to implement the above we could run that job for you and make the data available.

aaronkw commented 4 years ago

Thanks for the response! The linked dataset looks like what we'd like to have. We'll take a look at adding a new command to generate the new compendia.

Not sure if this alternative is actually easier, but is it possible to just download all of the samples (and we'll handle the aggregation steps)? It was unclear from the samples response if this was possible (the computed_files field did not lead to an object we could retrieve from the computed_files endpoint).

kurtwheeler commented 4 years ago

Oh snap, if that works for you then it works for me. It should be pretty straightforward to do actually.

So it sounds like you already found the samples.computed_files field. It does actually lead to computed files, just not at https://api.refine.bio/v1/computed_files/<id>, but you instead have to use the id as a query parameter like https://api.refine.bio/v1/computed_files/?id=<id>. I think we have an issue about it somewhere, it should just be /<id>

Now there's going to be a lot of samples and computed files that you don't want because we weren't able to process everything and we also have some computed files that are for QA or other purposes. I'd recommend filtering the samples endpoint with ?is_processed=True and only using computed_files where is_smashable is True (this is a field we use to denote that the file is valid input for our smasher jobs, which are the jobs that smash datasets together). A sample may have multiple computed files with is_smashable=True if we had to reprocess it for some reason, so just use the one with the most recent created_at field.

Finally, when you identify the correct computed file for each sample, you'll get a response containing among other things a s3_url. This URL won't work, because we need people to agree to our terms and conditions before we can give them data we processed. You'll need to obtain an API token with something like:

host = 'https://api.refine.bio'

def create_token():
    """ create a new token and activate it """
    response = requests.post(host + '/v1/token/')
    token_id = response.json()['id']
    response = requests.put(host + '/v1/token/' + token_id + '/', json.dumps(
        {'is_activated': True}), headers={'Content-Type': 'application/json'})
    return token_id

(Note that running that code so is effectively agreeing to our terms and conditions.)

Once you've got an API token you can provide it as the API-KEY header on your HTTP requests and the computed_files endpoint responses will also contain a download_url field you can use to download the file from!

kurtwheeler commented 4 years ago

Oh also, this comment might be helpful as well: https://github.com/AlexsLemonade/refinebio/issues/1980#issuecomment-591589350

It links to some helper functions @arielsvn made for someone and is where I got create_token from.

AlexsLemonade / refinebio