Open cgreene opened 4 years ago
We should have a command for this rerun_salmon_old_samples
, but I couldn't find any record from when we ran it. Looks like we have 530
experiments with mismatched salmon versions, with ~25k unprocessed samples still.
I'm going to wait until #2210 can be deployed to test this command and try to see why it didn't work for those experiments.
Retried sample DRR013807 with the downloader job id=7097213. We'll need to investigate why it didn't create any processor jobs afterward, the downloader job completed successfully.
Did some debugging of this, but still haven't found the problem. I manually set is_downloaded=False
for the OriginalFile
associated with that sample and now the downloader jobs are failing with the error:
Was told to redownload file(s) that are already downloaded!
Which is set on https://github.com/AlexsLemonade/refinebio/blob/dev/workers/data_refinery_workers/downloaders/utils.py#L76-L90
original_file.needs_downloading()
is True
so I don't yet understand why we're hitting that condition.
Just confirmed that #2218 was successful in fixing the downloader jobs, Salmon succeded on the sample DRR013807
I think I also found another issue with the command rerun_salmon_old_samples
, looks like it only queues samples where we successfully ran salmon. So, there're still 32 samples where salmon has never succeded that we should be able to requeue.
Examples of these include: DRR062866, DRR062867, DRR062868, DRR062869, DRR062870
I'm going to try queueing these samples to see if we succeed in any of them before trying with all others in the same case.
The volume ran out of space, and looks like we might need to add more ram for the SRA jobs. For sample DRR062899 all jobs were OOM-Killed:
The original file associated is ~13gb
.
I'm going to scale down and continue looking into this on monday.
Just retried sample DRR062899 and salmon ran successfully on it. Upon further investigation looks like the problem with these is that we are running out of space. There're only 100gb in the instance that was filled with 5 samples:
ubuntu@ip:/var/ebs$ du -hs *
50G DRP001150
3.4G processor_job_29572380_index
15G processor_job_29572383
13G processor_job_29572386
44K TRANSCRIPTOME_INDEX
4.0K VOLUME_INDEX
@kurtwheeler what would you recommend to process these samples? We could either increase the size of the disc or reduce the number of jobs that can run at the same time.
Looks like #2222 #2223 made things better, but there're still some SRA jobs that are failing, wither OOM-Killed or with the error "[Errno 12] Cannot allocate memory".
Sample DRR062880 failed with these errors, but then I retried it and it succeeded on the first downloader job:
Still looking into this, but running out of ideas. These jobs work sometimes and others they don't. When they fail we get the following error in the logs:
data_refinery_workers.downloaders.sra ERROR [downloader_job: 7098044]: Exception caught while downloading file.
Traceback (most recent call last):
File "/home/user/data_refinery_workers/downloaders/sra.py", line 108, in _download_file_http
shutil.copyfileobj(r.raw, f)
File "/usr/lib/python3.5/shutil.py", line 76, in copyfileobj
fdst.write(buf)
OSError: [Errno 12] Cannot allocate memory
And we have faced similar issues before https://github.com/AlexsLemonade/refinebio/issues/572 https://github.com/AlexsLemonade/refinebio/pull/1437 (these issues have started to appear in my google searches)
Either ram or disk. I'm guessing disk. Is there available space?
On Thu, Apr 9, 2020, 4:55 PM Ariel Rodriguez Romero < notifications@github.com> wrote:
Still looking into this, but running out of ideas. These jobs work sometimes and others they don't. When they fail we get the following error in the logs:
data_refinery_workers.downloaders.sra ERROR [downloader_job: 7098044]: Exception caught while downloading file. Traceback (most recent call last): File "/home/user/data_refinery_workers/downloaders/sra.py", line 108, in _download_file_http shutil.copyfileobj(r.raw, f) File "/usr/lib/python3.5/shutil.py", line 76, in copyfileobj fdst.write(buf) OSError: [Errno 12] Cannot allocate memory
And we have faced similar issues before #572 https://github.com/AlexsLemonade/refinebio/issues/572 #1437 https://github.com/AlexsLemonade/refinebio/pull/1437 (these issues have started to appear in my google searches)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AlexsLemonade/refinebio/issues/2006#issuecomment-611747925, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEEPM3ENYH62LN2JPQSNXTRLYY4HANCNFSM4J3OLJBQ .
Yes, we have space available:
$ df -h | grep /var/ebs
/dev/xvdf 2.0T 392G 1.5T 21% /var/ebs
We also found this that suggest that hitting the max file descriptor limit might cause that error as well.
Ahh - I've hit that before with many small files. Seems unusual in this case but it's possible to check that number (I think it's the number of inodes). Don't recall the command off hand though.
On Thu, Apr 9, 2020, 5:13 PM Ariel Rodriguez Romero < notifications@github.com> wrote:
Yes, we have space available:
$ df -h | grep /var/ebs /dev/xvdf 2.0T 392G 1.5T 21% /var/ebs
We also found this https://stackoverflow.com/a/45620524/763705 that suggest that hitting the max file descriptor limit might cause that error as well.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AlexsLemonade/refinebio/issues/2006#issuecomment-611755269, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEEPM7BPLPL2HTQHDVY7S3RLY3ADANCNFSM4J3OLJBQ .
The strategy we deployed in #2230 worked for all the samples I tried, so I think we should be ready to re-process all samples where:
v0.13.1
I couldn't find a single query to return all of those samples, so I'm running a script to calculate all of them.
Just queued for re-processing all these samples, here are their accession codes:
Out of the 30603
samples we added for re-processing, salmon has succeeded in 195
of them. Looks like there are still some in which the downloader jobs haven't been executed yet (ERR578398)
I'll wait until all of them finish to calculate how many samples we got.
Looking at the logs we are getting the following new error:
OSError: [Errno 30] Read-only file system: '/home/user/data_store/ERP006862'
Edit: Created https://github.com/AlexsLemonade/refinebio/issues/2237 to track this error
We deployed https://github.com/AlexsLemonade/refinebio/pull/2243 to go around the problems with the read-only disk from #2237. And currently, jobs are running.
https://github.com/AlexsLemonade/refinebio/pull/2243 needs to be reverted once these jobs finish so that the instances can use the EBS volumes again.
I don't know that we should revert. With our current downloading code, it seems we reliably kill EBS. Switching back to something we know to be bad seems to be worse than being able to run without EBS.
Thinking through things here - the huge EBS volumes were really helpful when we needed to download and process a ton of data over a long duration. These smaller, attached SSD volumes are going to be cheaper (we're getting them with the nodes we're already paying for anyway), more performant, and unless we go reprocess the backlog we probably don't need the 10TB volumes.
Yes, that makes sense. In that case, it will need additional changes on top of #2243, right now we are still relying on the number of volumes to register the jobs in nomad.
Created https://github.com/AlexsLemonade/refinebio/issues/2254 to remove the EBS volumes from the instances.
Also, Nomad started failing yesterday with the error https://github.com/hashicorp/nomad/issues/5509. All jobs appear to be running, but the allocation shows as pending. I'm thinking of re-deploying to see if that fixes the issue
We won't be able to address this until #2237 is solved. Once that happens there're still ~16k samples from the initial set where the downloader jobs failed. They can be re-queued with the following steps:
scp -i work/refinebio/infrastructure/data-refinery-key.pem samples.csv ubuntu@34.238.252.255:/tmp/samples.csv
(This should be executed inside the foreman)
from datetime import datetime, timedelta
from django.db.models.expressions import Q
from django.db.models import Count
from data_refinery_common.models import Sample
from django.utils import timezone
with open('/tmp/samples.csv') as f:
accession_codes = set(x.strip() for x in f)
samples = []
for accession in accession_codes:
has_processor = Sample.objects.get(accession_code=accession).original_files.annotate(processor=Count('processor_jobs', filter=Q(processor_jobs__created_at__gt=datetime(2020, 4, 12, tzinfo=timezone.utc)))).filter(processor__gt=0).exists()
if not has_processor:
samples.append(accession)
from data_refinery_foreman.foreman.management.commands.retry_samples import requeue_samples
requeue_samples(Sample.objects.filter(accession_code__in=samples))
Context
@arielsvn was looking into a user request in issue https://github.com/AlexsLemonade/refinebio/issues/1996#issuecomment-566190000 .
Problem or idea
In at least some cases, we had at least one
quant.sf
file already processed for a sample. When we upgraded salmon's transcriptome indices we ended up withquant.sf
files that cannot be merged.Solution or next step
We should query for any cases where have samples from the same experiment processed with different transcriptome indices. We should reprocess all of these with the latest version and attempt to run tximport over samples processed with the latest version.