Open taylorreiter opened 2 years ago
Hello @taylorreiter !
The loop is very slow. Is there a better way to query refine.bio to retrieve download links? I don't think this approach will scale to all RNA seq data in refinebio Is it possible to only get salmontools-result.tar.gz and not also -multiqc* files?
This should be possible. I will add filename
to the available API filters so you can filter explicitly for that.
After running through ~27 of iterations, I received ServerError: The server encountered an issue. Do you know what might of caused this, if there is a way to avoid this, or the best way to keep the code running if this is going to happen frequently?
I think there are a couple ways to improve that. The current default page size is 25 but you can get ~100 samples from the API relatively quickly. I will update here with a modified script that collects what you are hoping for.
Are the S3 download links persistent? If not, how long do they last for? I was planning on producing the download links, saving them to a spreadsheet, and then automating their download from S3, but I wasn't sure if this is ill-advised.
The download_url
link expires after 7 days (defined as seconds from creation time)
Does this information already exist is a database or json somewhere? I'm curious if it would be more efficient to work with the database directly considering i'm hoping to download so many files.
The information exists in the database and I would be able to provide a csv of everything. However generating signed URLs would be harder to generate on your end.
So once I am able to add filename
to the available filters for ComputedFile
at that point you will be able to:
computed_files = pyrb.ComputedFile.search(filename="salmontools-results.tar.gz", is_qc=True, limit=1000)
for cf in computed_files:
my_custom_handler(cf)
It may be possible that the above code might get throttled if you are doing > 100 iterations. If so you could pause for one second between each of the API calls (fetching another 1000 computed files).
computed_files = pyrb.ComputedFile.search(filename="salmontools-results.tar.gz", is_qc=True, limit=1000)
for cf_page in computed_files.pages:
for cf in cf_page:
my_custom_handler(cf)
time.sleep(1)
I will let you know here as soon as this is ready for you to try.
Also one more thing worth mentioning. The Processor
search method does not accept filtering at this time and the name that is used is Salmontools
.
processor_ids = [
processor.id
for processor in pyrb.Processor.search(limit=600)
if processor.name == "Salmontools"
]
I would love to download ~all of the unmapped reads for RNA-seq samples processed by refine.bio, with the following exceptions:
I trial'd out downloading the data using this notebook: https://github.com/taylorreiter/2022-refinebio-unmapped/blob/main/notebooks/try_pyrefinebio.ipynb The meat of the code is reproduced below:
Some challenges I encountered:
salmontools-result.tar.gz
and not also-multiqc*
files?ServerError: The server encountered an issue
. Do you know what might of caused this, if there is a way to avoid this, or the best way to keep the code running if this is going to happen frequently?Some questions I have:
Any insight you have would be greatly appreciated!