AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
129 stars 19 forks source link

Recover samples w/ mismatched salmon versions / transcriptome indices #2006

Open cgreene opened 4 years ago

cgreene commented 4 years ago

Context

@arielsvn was looking into a user request in issue https://github.com/AlexsLemonade/refinebio/issues/1996#issuecomment-566190000 .

Problem or idea

In at least some cases, we had at least one quant.sf file already processed for a sample. When we upgraded salmon's transcriptome indices we ended up with quant.sf files that cannot be merged.

Solution or next step

We should query for any cases where have samples from the same experiment processed with different transcriptome indices. We should reprocess all of these with the latest version and attempt to run tximport over samples processed with the latest version.

arielsvn commented 4 years ago

We should have a command for this rerun_salmon_old_samples, but I couldn't find any record from when we ran it. Looks like we have 530 experiments with mismatched salmon versions, with ~25k unprocessed samples still.

``` >>> eligible_experiments = ( ... Experiment.objects.filter(technology="RNA-SEQ", num_processed_samples=0) ... .annotate( ... num_salmon_versions=Count( ... "samples__results__organism_index__salmon_version", ... distinct=True, ... filter=Q( ... samples__results__processor__name=ProcessorEnum.SALMON_QUANT.value["name"] ... ), ... ) ... ) ... .filter(num_salmon_versions__gt=1) ... ) >>> eligible_experiments.count() 530 >>> eligible_experiments.values('samples__accession_code').distinct().count() 25368 ```

I'm going to wait until #2210 can be deployed to test this command and try to see why it didn't work for those experiments.

arielsvn commented 4 years ago

Retried sample DRR013807 with the downloader job id=7097213. We'll need to investigate why it didn't create any processor jobs afterward, the downloader job completed successfully.

arielsvn commented 4 years ago

Did some debugging of this, but still haven't found the problem. I manually set is_downloaded=False for the OriginalFile associated with that sample and now the downloader jobs are failing with the error:

Was told to redownload file(s) that are already downloaded!

Which is set on https://github.com/AlexsLemonade/refinebio/blob/dev/workers/data_refinery_workers/downloaders/utils.py#L76-L90

original_file.needs_downloading() is True so I don't yet understand why we're hitting that condition.

arielsvn commented 4 years ago

Just confirmed that #2218 was successful in fixing the downloader jobs, Salmon succeded on the sample DRR013807

I think I also found another issue with the command rerun_salmon_old_samples, looks like it only queues samples where we successfully ran salmon. So, there're still 32 samples where salmon has never succeded that we should be able to requeue.

Examples of these include: DRR062866, DRR062867, DRR062868, DRR062869, DRR062870

``` >>> from data_refinery_common.job_lookup import ProcessorEnum >>> experiment = Experiment.objects.get(accession_code="DRP001150") >>> experiment.samples.exclude(computationalresult__processor__name=ProcessorEnum.SALMON_QUANT.value["name"]).count() 32 ``` Here're all their accession codes: ```py >>> ','.join(experiment.samples.exclude(computationalresult__processor__name=ProcessorEnum.SALMON_QUANT.value["name"]).values_list('accession_code', flat=True)) 'DRR062866,DRR062867,DRR062868,DRR062869,DRR062870,DRR062872,DRR062873,DRR062875,DRR062876,DRR062877,DRR062880,DRR062882,DRR062883,DRR062884,DRR062886,DRR062887,DRR062888,DRR062889,DRR062890,DRR062891,DRR062893,DRR062895,DRR062896,DRR062897,DRR062903,DRR062907,DRR062908,DRR062909,DRR062910,DRR062911,DRR062914,DRR062915' >>> ','.join(experiment.samples.filter(computationalresult__processor__name=ProcessorEnum.SALMON_QUANT.value["name"]).values_list('accession_code', flat=True)) 'DRR013789,DRR013789,DRR013790,DRR013790,DRR013791,DRR013791,DRR013792,DRR013792,DRR013793,DRR013793,DRR013793,DRR013793,DRR013794,DRR013794,DRR013794,DRR013794,DRR013795,DRR013795,DRR013795,DRR013795,DRR013796,DRR013796,DRR013797,DRR013797,DRR013798,DRR013798,DRR013799,DRR013799,DRR013800,DRR013800,DRR013800,DRR013800,DRR013801,DRR013801,DRR013801,DRR013801,DRR013802,DRR013802,DRR013802,DRR013802,DRR013803,DRR013803,DRR013804,DRR013804,DRR013805,DRR013805,DRR013806,DRR013806,DRR013807,DRR013807,DRR013807,DRR062865,DRR062865,DRR062871,DRR062871,DRR062871,DRR062874,DRR062878,DRR062879,DRR062879,DRR062879,DRR062879,DRR062881,DRR062881,DRR062881,DRR062881,DRR062885,DRR062885,DRR062892,DRR062892,DRR062894,DRR062894,DRR062894,DRR062894,DRR062894,DRR062898,DRR062899,DRR062900,DRR062901,DRR062902,DRR062904,DRR062904,DRR062905,DRR062906,DRR062906,DRR062906,DRR062906,DRR062912,DRR062913,DRR062913,DRR062913,DRR062913' ``` Logs from the retry samples command: ``` 2020-04-03 18:54:43,509 i-046e5ec4e2ca751da [volume: -1] data_refinery_foreman.foreman.management.commands.retry_samples INFO: Re-queued 32 samples with accession codes DRR062866,DRR062867,DRR062868,DRR062869,DRR062870,DRR062872,DRR062873,DRR062875,DRR062876,DRR062877,DRR062880,DRR062882,DRR062883,DRR062884,DRR062886,DRR062887,DRR062888,DRR062889,DRR062890,DRR062891,DRR062893,DRR062895,DRR062896,DRR062897,DRR062903,DRR062907,DRR062908,DRR062909,DRR062910,DRR062911,DRR062914,DRR062915. ```

I'm going to try queueing these samples to see if we succeed in any of them before trying with all others in the same case.

arielsvn commented 4 years ago

The volume ran out of space, and looks like we might need to add more ram for the SRA jobs. For sample DRR062899 all jobs were OOM-Killed:

image

The original file associated is ~13gb.

I'm going to scale down and continue looking into this on monday.

arielsvn commented 4 years ago

Just retried sample DRR062899 and salmon ran successfully on it. Upon further investigation looks like the problem with these is that we are running out of space. There're only 100gb in the instance that was filled with 5 samples:

ubuntu@ip:/var/ebs$ du -hs *
50G     DRP001150
3.4G    processor_job_29572380_index
15G     processor_job_29572383
13G     processor_job_29572386
44K     TRANSCRIPTOME_INDEX
4.0K    VOLUME_INDEX

@kurtwheeler what would you recommend to process these samples? We could either increase the size of the disc or reduce the number of jobs that can run at the same time.

arielsvn commented 4 years ago

Looks like #2222 #2223 made things better, but there're still some SRA jobs that are failing, wither OOM-Killed or with the error "[Errno 12] Cannot allocate memory".

Sample DRR062880 failed with these errors, but then I retried it and it succeeded on the first downloader job:

image

arielsvn commented 4 years ago

Still looking into this, but running out of ideas. These jobs work sometimes and others they don't. When they fail we get the following error in the logs:

data_refinery_workers.downloaders.sra ERROR [downloader_job: 7098044]: Exception caught while downloading file.
Traceback (most recent call last):
  File "/home/user/data_refinery_workers/downloaders/sra.py", line 108, in _download_file_http
    shutil.copyfileobj(r.raw, f)
  File "/usr/lib/python3.5/shutil.py", line 76, in copyfileobj
    fdst.write(buf)
OSError: [Errno 12] Cannot allocate memory

And we have faced similar issues before https://github.com/AlexsLemonade/refinebio/issues/572 https://github.com/AlexsLemonade/refinebio/pull/1437 (these issues have started to appear in my google searches)

cgreene commented 4 years ago

Either ram or disk. I'm guessing disk. Is there available space?

On Thu, Apr 9, 2020, 4:55 PM Ariel Rodriguez Romero < notifications@github.com> wrote:

Still looking into this, but running out of ideas. These jobs work sometimes and others they don't. When they fail we get the following error in the logs:

data_refinery_workers.downloaders.sra ERROR [downloader_job: 7098044]: Exception caught while downloading file. Traceback (most recent call last): File "/home/user/data_refinery_workers/downloaders/sra.py", line 108, in _download_file_http shutil.copyfileobj(r.raw, f) File "/usr/lib/python3.5/shutil.py", line 76, in copyfileobj fdst.write(buf) OSError: [Errno 12] Cannot allocate memory

And we have faced similar issues before #572 https://github.com/AlexsLemonade/refinebio/issues/572 #1437 https://github.com/AlexsLemonade/refinebio/pull/1437 (these issues have started to appear in my google searches)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AlexsLemonade/refinebio/issues/2006#issuecomment-611747925, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEEPM3ENYH62LN2JPQSNXTRLYY4HANCNFSM4J3OLJBQ .

arielsvn commented 4 years ago

Yes, we have space available:

$ df -h | grep /var/ebs
/dev/xvdf       2.0T  392G  1.5T  21% /var/ebs

We also found this that suggest that hitting the max file descriptor limit might cause that error as well.

cgreene commented 4 years ago

Ahh - I've hit that before with many small files. Seems unusual in this case but it's possible to check that number (I think it's the number of inodes). Don't recall the command off hand though.

On Thu, Apr 9, 2020, 5:13 PM Ariel Rodriguez Romero < notifications@github.com> wrote:

Yes, we have space available:

$ df -h | grep /var/ebs /dev/xvdf 2.0T 392G 1.5T 21% /var/ebs

We also found this https://stackoverflow.com/a/45620524/763705 that suggest that hitting the max file descriptor limit might cause that error as well.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AlexsLemonade/refinebio/issues/2006#issuecomment-611755269, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEEPM7BPLPL2HTQHDVY7S3RLY3ADANCNFSM4J3OLJBQ .

arielsvn commented 4 years ago

The strategy we deployed in #2230 worked for all the samples I tried, so I think we should be ready to re-process all samples where:

I couldn't find a single query to return all of those samples, so I'm running a script to calculate all of them.

```py import time from data_refinery_common.utils import queryset_page_iterator from django.db.models.expressions import Q from django.db.models import Count from data_refinery_common.job_lookup import ProcessorEnum from data_refinery_common.models import Experiment eligible_experiments = ( Experiment.objects.filter(technology="RNA-SEQ", num_processed_samples=0) .annotate( num_salmon_versions=Count( "samples__results__organism_index__salmon_version", distinct=True, filter=Q( samples__results__processor__name=ProcessorEnum.SALMON_QUANT.value["name"] ), ) ) .filter(num_salmon_versions__gt=1) ) result = {} skipped_experiments = [] # Step 1: Extract the samples that we want to requeue from the eligible_experiments # This times out for experiments with more than 1344 samples, so for those we # will calculate this differently for page in queryset_page_iterator(eligible_experiments, 50): time.sleep(1) for experiment in page: if experiment.num_total_samples > 1344: skipped_experiments.append(experiment.accession_code) continue samples = experiment.samples.exclude(results__organism_index__salmon_version='salmon 0.13.1') result[experiment.accession_code] = list(samples.values_list('accession_code', flat=True)) # Step 2: Calculate the samples we'll want to include for the skipped experiments # skipped_experiments = ['ERP006662','ERP017126','SRP045452','SRP025982','SRP068957','SRP067759','SRP056902','ERP014302','ERP004375'] for experiment_accession_code in skipped_experiments: experiment = Experiment.objects.get(accession_code=experiment_accession_code) samples = [] for sample in experiment.samples.all(): if sample.results.filter(organism_index__salmon_version='salmon 0.13.1').exists(): continue samples.append(sample.accession_code) result[experiment_accession_code] = samples # Step 3: Save the samples and experiment accession codes into a file with open('/tmp/samples.csv', mode='w') as f: samples = set(sample_accession_code for sample_list in result.values() for sample_accession_code in sample_list) f.write('\n'.join(samples)) with open('/tmp/experiments.csv', mode='w') as f: f.write('\n'.join(result.keys())) ``` Then copy generated files into local computer with: ```bash scp -i work/refinebio/infrastructure/data-refinery-key.pem ubuntu@3.84.62.246:/tmp/samples.csv samples.csv scp -i work/refinebio/infrastructure/data-refinery-key.pem ubuntu@3.84.62.246:/tmp/experiments.csv experiments.csv ```
arielsvn commented 4 years ago

Just queued for re-processing all these samples, here are their accession codes:

samples.csv.txt

experiments.csv.txt

```py from data_refinery_foreman.foreman.management.commands.retry_samples import requeue_samples samples = set(sample_accession_code for sample_list in result.values() for sample_accession_code in sample_list) requeue_samples(Sample.objects.filter(accession_code__in=samples)) ``` ```py >>> from data_refinery_common.models import Sample >>> Sample.objects.filter(is_processed=True).count() 960681 ```
arielsvn commented 4 years ago

Out of the 30603 samples we added for re-processing, salmon has succeeded in 195 of them. Looks like there are still some in which the downloader jobs haven't been executed yet (ERR578398)

I'll wait until all of them finish to calculate how many samples we got.

```py from datetime import datetime, timedelta from django.db.models.expressions import Q from django.db.models import Count from data_refinery_common.models import Sample from django.utils import timezone with open('/tmp/samples.csv') as f: accession_codes = set(x.strip() for x in f) # total samples added for re-processing len(accession_codes) # Calculate number of samples where salmon succeded Sample.objects.filter(accession_code__in=accession_codes, results__organism_index__salmon_version='salmon 0.13.1').count() # calculate number of samples with no processor jobs samples = [] for accession in accession_codes: has_processor = Sample.objects.get(accession_code=accession).original_files.annotate(processor=Count('processor_jobs', filter=Q(processor_jobs__created_at__gt=datetime(2020, 4, 12, tzinfo=timezone.utc)))).filter(processor__gt=0).exists() if not has_processor: samples.append(accession) print(len(samples)) ```
arielsvn commented 4 years ago

Looking at the logs we are getting the following new error:

OSError: [Errno 30] Read-only file system: '/home/user/data_store/ERP006862'

Edit: Created https://github.com/AlexsLemonade/refinebio/issues/2237 to track this error

arielsvn commented 4 years ago

We deployed https://github.com/AlexsLemonade/refinebio/pull/2243 to go around the problems with the read-only disk from #2237. And currently, jobs are running.

https://github.com/AlexsLemonade/refinebio/pull/2243 needs to be reverted once these jobs finish so that the instances can use the EBS volumes again.

cgreene commented 4 years ago

I don't know that we should revert. With our current downloading code, it seems we reliably kill EBS. Switching back to something we know to be bad seems to be worse than being able to run without EBS.

cgreene commented 4 years ago

Thinking through things here - the huge EBS volumes were really helpful when we needed to download and process a ton of data over a long duration. These smaller, attached SSD volumes are going to be cheaper (we're getting them with the nodes we're already paying for anyway), more performant, and unless we go reprocess the backlog we probably don't need the 10TB volumes.

arielsvn commented 4 years ago

Yes, that makes sense. In that case, it will need additional changes on top of #2243, right now we are still relying on the number of volumes to register the jobs in nomad.

arielsvn commented 4 years ago

Created https://github.com/AlexsLemonade/refinebio/issues/2254 to remove the EBS volumes from the instances.

Also, Nomad started failing yesterday with the error https://github.com/hashicorp/nomad/issues/5509. All jobs appear to be running, but the allocation shows as pending. I'm thinking of re-deploying to see if that fixes the issue

``` ubuntu@ip-10-0-171-201:~$ nomad status SALMON_0_16384/dispatch-1587571332-9b1d8a61 ID = SALMON_0_16384/dispatch-1587571332-9b1d8a61 Name = SALMON_0_16384/dispatch-1587571332-9b1d8a61 Submit Date = 2020-04-22T16:02:12Z Type = batch Priority = 50 Datacenters = dc1 Status = running Periodic = false Parameterized = false Summary Task Group Queued Starting Running Failed Complete Lost jobs 0 1 0 0 0 0 Allocations ID Node ID Task Group Version Desired Status Created Modified 8e33f638 9090fa59 jobs 0 run pending 1m1s ago 1m1s ago ubuntu@ip-10-0-171-201:~$ nomad status 8e33f638 ID = 8e33f638 Eval ID = 0e342521 Name = SALMON_0_16384/dispatch-1587571332-9b1d8a61.jobs[0] Node ID = 9090fa59 Job ID = SALMON_0_16384/dispatch-1587571332-9b1d8a61 Job Version = 0 Client Status = pending Client Description = Desired Status = run Desired Description = Created = 1m8s ago Modified = 1m8s ago Couldn't retrieve stats: Unexpected response code: 404 (Unknown allocation "8e33f638-f7a1-49ee-243b-28344d1e3e21") ```
arielsvn commented 4 years ago

We won't be able to address this until #2237 is solved. Once that happens there're still ~16k samples from the initial set where the downloader jobs failed. They can be re-queued with the following steps:

1. Copy the file with the sample accession codes samples.csv.txt into the instance

scp  -i work/refinebio/infrastructure/data-refinery-key.pem samples.csv ubuntu@34.238.252.255:/tmp/samples.csv

2. Calculate the accession codes of the samples that we haven't tried to process since we started with this set

(This should be executed inside the foreman)

from datetime import datetime, timedelta
from django.db.models.expressions import Q
from django.db.models import Count
from data_refinery_common.models import Sample
from django.utils import timezone
with open('/tmp/samples.csv') as f:
  accession_codes = set(x.strip() for x in f)

samples = []
for accession in accession_codes:
    has_processor = Sample.objects.get(accession_code=accession).original_files.annotate(processor=Count('processor_jobs', filter=Q(processor_jobs__created_at__gt=datetime(2020, 4, 12, tzinfo=timezone.utc)))).filter(processor__gt=0).exists()
    if not has_processor:
        samples.append(accession)

3. Re-queue all these samples

from data_refinery_foreman.foreman.management.commands.retry_samples import requeue_samples
requeue_samples(Sample.objects.filter(accession_code__in=samples))