Clinical-Genomics / scout

VCF visualization interface
https://clinical-genomics.github.io/scout
BSD 3-Clause "New" or "Revised" License
152 stars 46 forks source link

Request BAM files to remain in Scout for 6 months #1000

Closed NicoleCMMS closed 6 years ago

NicoleCMMS commented 6 years ago

Last week at a GMCK Scout users meeting where there are representatives from clinical genetics, immunology and CMMS we discussed the length of time the BAM files are available in Scout and all three clinics are in agreement that 3 months is too short. Ideally we would like to have access to them until the case is solved/archived but we understand that there is a lack of data capacity for this. Would it be possible to extend the availability of these files to 6 months instead of 3?

moonso commented 6 years ago

Number 1000 🍰 🍾 👾 Congratulations!!!

henrikstranneheim commented 6 years ago

It is not possible in our current cluster. However, the installation of the new cluster with more data storage is progressing nicely so far. We could probably extend the availability once we have migrated there. I do not dare say when have completed the migration, but we are talking months here. We are also looking into software that will increase our ability to reduce the storage footprint of BAMs, but nothing have been decided yet.

NicoleCMMS commented 6 years ago

Excellent, good to hear it is on the wish list when the problem with data storage is solved. Måns it’s always good to celebrate….even the small things in life 😉

dnil commented 6 years ago

Here is some food for thought regarding implementing user initiated deletion fully. Most cases that do get (marked) solved are indeed solved within a shorter time frame than three months:

screenshot 2018-10-11 at 16 02 07

The same picture can be seen for archiving, though the tail is even longer here:

screenshot 2018-10-11 at 16 31 44

(The latter is slightly approximated upwards as the activity log does not explicitly state "archive". Instead I used the time to the latest status change on each archived case.)

We could have saved some 750 case storage months by deleting these as soon as they were finished, but as the total 3 month regime for all cases in the current instance gives some 8388 case storage months, the saving is about 8%. Not enough for another month of storage for remaining cases.

moonso commented 6 years ago

Well done @dnil ! This might be something for @ingkebil to see to?

henrikstranneheim commented 6 years ago

Nice work Daniel! At least now we now that we should focus on the other options.

ingkebil commented 6 years ago

As Henrik says: in our setting this is near to impossible.

Some context on the shelf life of files on our main storage unit:

We can reduce the amount of time we keep the fastq files and augment the amount of time for bam files for now. e.g. take 30 days from the fastq and add them to the bam files.

What I am wondering: how many of the rerun requests are for bam file generation?

NicoleCMMS commented 6 years ago

For us at CMMS the majority of the reruns are ordered to enable analysis against an updated version of our databases. A small number are rerun in order to get the bam files back.

Från: Kenny Billiau notifications@github.com Skickat: den 11 oktober 2018 21:29 Till: Clinical-Genomics/scout scout@noreply.github.com Kopia: Nicole Lesko nicole.lesko@sll.se; Author author@noreply.github.com Ämne: Re: [Clinical-Genomics/scout] Request BAM files to remain in Scout for 6 months (#1000)

As Henrik says: in our setting this is near to impossible.

Some context on the shelf life of files on our main storage unit:

We can reduce the amount of time we keep the fastq files and augment the amount of time for bam files for now. e.g. take 30 days from the fastq and add them to the bam files.

What I am wondering: how many of the rerun requests are for bam file generation?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Clinical-Genomics/scout/issues/1000#issuecomment-429088083, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AVWjTyJTi5SY4i3Khuy6Tv2Vb2LYSWklks5uj5wFgaJpZM4XVkhP.

ingkebil commented 6 years ago

I find it reasonable to hang on to the BAM files for as long as we can and favour deleting fastq files over BAM.

What if we change the the shelf life as such:

@emiliaol @vwirta anyone any objections?

vwirta commented 6 years ago

This would reruns from fastq more challening. How easy would it be to go from BAM to fastq? Or start MIP from BAM?

Did you have a more indepth look at Petagene? Could this be part of the solution?

On 12 Oct 2018, at 22:02, Kenny Billiau notifications@github.com wrote:

I find it reasonable to hang on to the BAM files for as long as we can and favour deleting fastq files over BAM.

What if we change the the shelf life as such:

fastq: 30 days BAM: 170 days @emiliaol https://github.com/emiliaol @vwirta https://github.com/vwirta anyone any objections?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Clinical-Genomics/scout/issues/1000#issuecomment-429446380, or mute the thread https://github.com/notifications/unsubscribe-auth/AChiGRgIsih5Q6jZnWX9IZJCSeoJ7vaoks5ukPVFgaJpZM4XVkhP.

ingkebil commented 6 years ago

This would only be a solution for rasta. I don't think we need to keep ourselves to these strict shelf lives on hasta. Yet.

I don't know how much we are rerunning a case in the first 100 days. Querying scout might give us an answer to that. Or querying EOL ;)

vwirta commented 6 years ago

Ok, for rasta this sounds like a good quick fix. Let’s query EOL to get her opinion :)

On 12 Oct 2018, at 22:34, Kenny Billiau notifications@github.com wrote:

This would only be a solution for rasta. I don't think we need to keep ourselves to these strict shelf lives on hasta. Yet.

I don't know how much we are rerunning a case in the first 100 days. Querying scout might give us an answer to that. Or querying EOL ;)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Clinical-Genomics/scout/issues/1000#issuecomment-429454493, or mute the thread https://github.com/notifications/unsubscribe-auth/AChiGZckIjXnlznITkEiL_bNOSj2CWShks5ukPztgaJpZM4XVkhP.

hassanfa commented 6 years ago

IMHO, making an unaligned bam file might be even better solution. Going to aligned bam file from unaligned bam file is fairly straightforward.

ingkebil commented 6 years ago

I don't know if that is relevant for Nicole's question? Current focus is only to shelf life of BAM files.

If it is relevant, can we take that up in another thread?

hassanfa commented 6 years ago

You're right another thread. But we should remember, aligners are not prefect. If fastq files are removed and only bam files are kept, then possible analysis reruns might be tricky. If a decision is made to keep bam files, a complete bam file which includes unaligned reads has to be an option.

northwestwitch commented 6 years ago

What about converting bam to cram files? The format is supported by igv and it would save us a lot of space. Could we start doing it on the new cluster?

ingkebil commented 6 years ago

bump @emiliaol :)

ingkebil commented 6 years ago

Shelf life of files has been changed to:

Please be aware we might have to adjust these numbers to take some pressure of our people who deal with reruns.

Hope that helps.

ingkebil commented 6 years ago

Hi Nicole,

shelf life has now been changed to:

Please be aware we might have to adjust these numbers depending on the amount of reruns requested.

Hope this helps, Kenny

NicoleCMMS commented 6 years ago

Excellent, thanks!

Nicole

Från: Kenny Billiau notifications@github.com Skickat: den 22 oktober 2018 10:24 Till: Clinical-Genomics/scout scout@noreply.github.com Kopia: Nicole Lesko nicole.lesko@sll.se; Author author@noreply.github.com Ämne: Re: [Clinical-Genomics/scout] Request BAM files to remain in Scout for 6 months (#1000)

Hi Nicole,

shelf life has now been changed to:

Please be aware we might have to adjust these numbers depending on the amount of reruns requested.

Hope this helps, Kenny

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Clinical-Genomics/scout/issues/1000#issuecomment-431765991, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AVWjT7U6F9UK0XgWiksBeMZkO8MxTRn0ks5unYC5gaJpZM4XVkhP.