Closed mr-c closed 2 years ago
We could possibly use a different chromosome to generate the STAR index, which could result in a smaller dataset. The fastq files can be reduced to only one file.
I think this would be very useful. My laptop crashed trying to create the star index files so I downloaded them instead, but this took quite a while.
We are running on chromosome 19 -- so I believe the difference between 19 and 21 would not solve this issue. I would recommend instead making a comment on the text itself about how much memory is need to run the index command on a laptop. Unfortunately real world examples take real world sized data.
We are running on chromosome 19 -- so I believe the difference between 19 and 21 would not solve this issue.
Can we use half of the file?
Um, not sure that would make sense and give us a real result. I would have to play around with the existing data. I believe the initial problem was the indexing and time to download the index file. If we just note 1) that if you have less than X amount of memory/RAM you should download and 2) The time for download and perhaps ask students to download the date before the session or note that they will have wait. I am not sure of the download time, but we could give an estimate.
As a note, the alignment process requires 9GB of RAM ResourceRequirement: ramMin: 9000
It looks like the aindex corrections takes ~8 GB of RAM We profiled it here: https://arvados.org/2021/12/07/debugging-cwl-in-arvados/
This example used a soybean data set -- https://bioinformatics.uconn.edu/resources-and-events/tutorials-2/rna-seq-tutorial-with-reference-genome/# ---> it might be easier to use that and adjust the reference that trim the human data.
The sizes might be similar though -- https://data.jgi.doe.gov/refine-download/phytozome?q=Gmax_275&expanded=Phytozome-275
Wikipedia (that ever reliable source of information...) tells me that the soybean genome is still ~1/3 of the size of the human genome. Would it be possible to use something smaller still - nematode or fruit fly for example? Or would these not be suitable / available for our purposes here?
https://en.wikipedia.org/wiki/List_of_sequenced_eukaryotic_genomes
We just have to find a resource that has it - and has the right format. I can try to find a fruit fly example - however we want to balance having a small "toy" example and something that does a bit of crunch to show that CWL is useful.
Galaxy has a fruitfly example (we need to give a ref to the data, as they did - and perhaps a call out to them but...) https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/ref-based/tutorial.html
I've started some new material based on the fruitfly example here: https://github.com/douglowe/cwl-tutorial-material-prototype
I'm trying to map their examples to the CWL tools - slowly getting there, but if anyone has more familiarity with the science wants to advise on this that would be great!
I've removed the previous data download instructions https://github.com/carpentries-incubator/cwl-novice-tutorial/blob/071eed284a2ce79fc660ee1d061838bc54bbd1d1/setup.md#files
Reopening this issue - to document the resources needed for generating the (reduced) STAR index.
I have now managed to generate the index on my laptop (MacBook Pro, 2.3 GHz Dual-Core Intel Core i5, 16 GB RAM). The docker settings I used for this were: CPUs (2); Memory (10Gb); Swap (3Gb); Disk image size (59.6Gb / 4.7Gb used). And I added these lines to the hints
section of STAR-Index.cwl
:
ResourceRequirement:
coresMin: 4
Building the index took 9 minutes, and the maximum memory usage was reported at 9288MiB. The longest step is:
... sorting Suffix Array chunks and saving them to disk...
which takes ~5-7 minutes.
Previously this had not worked for me (the task froze, and I killed it after 9 hours). The changes I made before this worked were: (1) deleted & reinstalled docker; (2) added to hints in STAR-Index.cwl
; (3) increased Memory from (6Gb) to (10Gb); (4) increased Swap from (1Gb) to (3Gb). I'll experiment with 2-4 to see if I can find the key change.
(2) is not necessary - the indexing works fine using a single CPU (with no noticeable change to the process) Max memory used: 9640MiB
(4) is not necessary - indexing is okay with a Swap of 1Gb. Run time was 11 minutes (slightly slower). Max memory used: 9681MiB
(3) is the problem - the minimum Memory allocation required for docker (for my system, at least) is 9Gb, using the Resources menu in docker desktop.
I can't seem to get docker to set a minimum RAM value - using the ramMin
resource requirement, and --strict-memory-limit
flag for cwltool
does cause the --memory
flag to be used in the docker call. But I've not found a value large enough that causes docker to not run. So I think we'll just have to add a warning to the setup page about the minimum RAM needed, and warn users what to expect if their job is going to fail.
400MiB is doable, but smaller is better
(via)