AlexsLemonade / refinebio-py

A python client for the refine.bio API.
BSD 3-Clause "New" or "Revised" License
1 stars 3 forks source link

Downloading unmapped reads contained in `salmontools-results.tar.gz` from ~all RNA seq samples on refine.bio #79

Open taylorreiter opened 2 years ago

taylorreiter commented 2 years ago

I would love to download ~all of the unmapped reads for RNA-seq samples processed by refine.bio, with the following exceptions:

I trial'd out downloading the data using this notebook: https://github.com/taylorreiter/2022-refinebio-unmapped/blob/main/notebooks/try_pyrefinebio.ipynb The meat of the code is reproduced below:

processor_ids = [processor.id for processor in pyrb.Processor.search(name="SALMONTOOLS")]
samples = pyrb.Sample.search(technology="RNA-SEQ")

filepaths = []
accessions = []
computed_filenames = []
downloadurls = []

for sample in samples[0:99]:
    for result in sample.results:
        if result.processor.id in processor_ids:
            computed_files = pyrb.ComputedFile.search(result__id=result.id)

            for computed_file in computed_files:
                if computed_file.is_qc:
                    filepaths.append(sample.accession_code + "-" + computed_file.filename)
                    accessions.append(sample.accession_code)
                    computed_filenames.append(computed_file.filename)
                    downloadurls.append(computed_file.download_url)

Some challenges I encountered:

Some questions I have:

Any insight you have would be greatly appreciated!

davidsmejia commented 2 years ago

Hello @taylorreiter !

The loop is very slow. Is there a better way to query refine.bio to retrieve download links? I don't think this approach will scale to all RNA seq data in refinebio Is it possible to only get salmontools-result.tar.gz and not also -multiqc* files?

This should be possible. I will add filename to the available API filters so you can filter explicitly for that.

After running through ~27 of iterations, I received ServerError: The server encountered an issue. Do you know what might of caused this, if there is a way to avoid this, or the best way to keep the code running if this is going to happen frequently?

I think there are a couple ways to improve that. The current default page size is 25 but you can get ~100 samples from the API relatively quickly. I will update here with a modified script that collects what you are hoping for.

Are the S3 download links persistent? If not, how long do they last for? I was planning on producing the download links, saving them to a spreadsheet, and then automating their download from S3, but I wasn't sure if this is ill-advised.

The download_url link expires after 7 days (defined as seconds from creation time)

Does this information already exist is a database or json somewhere? I'm curious if it would be more efficient to work with the database directly considering i'm hoping to download so many files.

The information exists in the database and I would be able to provide a csv of everything. However generating signed URLs would be harder to generate on your end.

So once I am able to add filename to the available filters for ComputedFile at that point you will be able to:

 computed_files = pyrb.ComputedFile.search(filename="salmontools-results.tar.gz", is_qc=True, limit=1000)

 for cf in computed_files:
     my_custom_handler(cf)

It may be possible that the above code might get throttled if you are doing > 100 iterations. If so you could pause for one second between each of the API calls (fetching another 1000 computed files).

 computed_files = pyrb.ComputedFile.search(filename="salmontools-results.tar.gz", is_qc=True, limit=1000)

 for cf_page in computed_files.pages:
     for cf in cf_page:
       my_custom_handler(cf)
     time.sleep(1)

I will let you know here as soon as this is ready for you to try.


Also one more thing worth mentioning. The Processor search method does not accept filtering at this time and the name that is used is Salmontools.

processor_ids = [
    processor.id
    for processor in pyrb.Processor.search(limit=600)
    if processor.name == "Salmontools"
]