Shrink dataset - Githubissues

mr-c commented 2 years ago

400MiB is doable, but smaller is better

ALuesink commented 2 years ago

We could possibly use a different chromosome to generate the STAR index, which could result in a smaller dataset. The fastq files can be reduced to only one file.

gcapes commented 2 years ago

I think this would be very useful. My laptop crashed trying to create the star index files so I downloaded them instead, but this took quite a while.

swzCuroverse commented 2 years ago

We are running on chromosome 19 -- so I believe the difference between 19 and 21 would not solve this issue. I would recommend instead making a comment on the text itself about how much memory is need to run the index command on a laptop. Unfortunately real world examples take real world sized data.

mr-c commented 2 years ago

We are running on chromosome 19 -- so I believe the difference between 19 and 21 would not solve this issue.

Can we use half of the file?

swzCuroverse commented 2 years ago

Um, not sure that would make sense and give us a real result. I would have to play around with the existing data. I believe the initial problem was the indexing and time to download the index file. If we just note 1) that if you have less than X amount of memory/RAM you should download and 2) The time for download and perhaps ask students to download the date before the session or note that they will have wait. I am not sure of the download time, but we could give an estimate.

As a note, the alignment process requires 9GB of RAM ResourceRequirement: ramMin: 9000

It looks like the aindex corrections takes ~8 GB of RAM We profiled it here: https://arvados.org/2021/12/07/debugging-cwl-in-arvados/

swzCuroverse commented 2 years ago

This example used a soybean data set -- https://bioinformatics.uconn.edu/resources-and-events/tutorials-2/rna-seq-tutorial-with-reference-genome/# ---> it might be easier to use that and adjust the reference that trim the human data.

The sizes might be similar though -- https://data.jgi.doe.gov/refine-download/phytozome?q=Gmax_275&expanded=Phytozome-275

douglowe commented 2 years ago

Wikipedia (that ever reliable source of information...) tells me that the soybean genome is still ~1/3 of the size of the human genome. Would it be possible to use something smaller still - nematode or fruit fly for example? Or would these not be suitable / available for our purposes here?

https://en.wikipedia.org/wiki/List_of_sequenced_eukaryotic_genomes

swzCuroverse commented 2 years ago

We just have to find a resource that has it - and has the right format. I can try to find a fruit fly example - however we want to balance having a small "toy" example and something that does a bit of crunch to show that CWL is useful.

swzCuroverse commented 2 years ago

Galaxy has a fruitfly example (we need to give a ref to the data, as they did - and perhaps a call out to them but...) https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/ref-based/tutorial.html

douglowe commented 2 years ago

I've started some new material based on the fruitfly example here: https://github.com/douglowe/cwl-tutorial-material-prototype

I'm trying to map their examples to the CWL tools - slowly getting there, but if anyone has more familiarity with the science wants to advise on this that would be great!

mr-c commented 2 years ago

I've removed the previous data download instructions https://github.com/carpentries-incubator/cwl-novice-tutorial/blob/071eed284a2ce79fc660ee1d061838bc54bbd1d1/setup.md#files

douglowe commented 2 years ago

Reopening this issue - to document the resources needed for generating the (reduced) STAR index.

douglowe commented 2 years ago

I have now managed to generate the index on my laptop (MacBook Pro, 2.3 GHz Dual-Core Intel Core i5, 16 GB RAM). The docker settings I used for this were: CPUs (2); Memory (10Gb); Swap (3Gb); Disk image size (59.6Gb / 4.7Gb used). And I added these lines to the hints section of STAR-Index.cwl:

  ResourceRequirement:
    coresMin: 4

Building the index took 9 minutes, and the maximum memory usage was reported at 9288MiB. The longest step is:

... sorting Suffix Array chunks and saving them to disk...

which takes ~5-7 minutes.

Previously this had not worked for me (the task froze, and I killed it after 9 hours). The changes I made before this worked were: (1) deleted & reinstalled docker; (2) added to hints in STAR-Index.cwl; (3) increased Memory from (6Gb) to (10Gb); (4) increased Swap from (1Gb) to (3Gb). I'll experiment with 2-4 to see if I can find the key change.

douglowe commented 2 years ago

(2) is not necessary - the indexing works fine using a single CPU (with no noticeable change to the process) Max memory used: 9640MiB

douglowe commented 2 years ago

(4) is not necessary - indexing is okay with a Swap of 1Gb. Run time was 11 minutes (slightly slower). Max memory used: 9681MiB

douglowe commented 2 years ago

(3) is the problem - the minimum Memory allocation required for docker (for my system, at least) is 9Gb, using the Resources menu in docker desktop.

I can't seem to get docker to set a minimum RAM value - using the ramMin resource requirement, and --strict-memory-limit flag for cwltool does cause the --memory flag to be used in the docker call. But I've not found a value large enough that causes docker to not run. So I think we'll just have to add a warning to the setup page about the minimum RAM needed, and warn users what to expect if their job is going to fail.

carpentries-incubator / cwl-novice-tutorial

Shrink dataset #51