VollmerLab / Genomic-Signatures-of-Disease-Resistance-in-Staghorn-Corals

Code associated with manuscript analyzing the genomic basis of disease resistance in staghorn corals
1 stars 0 forks source link

Reproducibility issues #2

Closed paigeduffin closed 4 months ago

paigeduffin commented 5 months ago

Good evening,

Firstly, I'd like to thank you for your awesome work, and for making your code and data publicly available- I am eager to work with it but, unfortunately, I have run into several issues that I'm hoping you may be able to assist me with. Any help would be monumentally appreciated!!

I submitted an issue a few days ago with a surface-level question- or, rather, my attempts to work with this pipeline were surface level at that point.

Since then, I've decided I wanted to try and replicate the entire pipeline from start to finish. I'm at the first step, and running into some reproducibility issues. I've been able to work through a few of these on my own, but as I am just learning a lot of this, the modifications add up to a lot of confusion. Further, I believe I've run into something I can't work through on my own. Here are some examples of the problems I've encountered in step 1 (preprocess sequences):

Example 1: In order to run

    bash bash_code/preProcess.pipeline \
      variant_calling \
      genome/acerv_genome.fasta \
      PE \
      DNA \
      140 \
      variant_calling/raw_reads/*fastq.gz

The user needs to make their own subdirectories within the working directory: ../../variant_calling, ../../genome (with acerv_genome.fasta manually downloaded and renamed), ../../variant_calling, and ../../variant_calling/raw_reads, and this requirement is not clearly stated.

Additionally, downloading raw reads using the bioproject number provided is not trivial, as the datasets download command does not work for SRA like it does for genome accessions. Further, the downloaded files are in SRA format and need to be converted from .sra to .fastq, gzipped, and moved to the variant_calling/raw_reads folder.

Example 2: This is perhaps unavoidable - I do not know - but I'm finding it very difficult to find all of the modules / R packages I need to have installed to my machine to make the code work.

Example 3. I've only looked in detail at two Bash script files as of writing this issue: (1) "preProcess.pipeline" and (2) "runRscript.slurm”. I'm attempting to sort out what "runRscript.slurm" accomplishes and am having difficulty with the lack of annotations and extra commented out lines. However, what ultimately lead to me believing I needed to write this issue was that (I think) I do not have access to a file that needs to be in a specific path for the line "RSTUDIO_IMAGE="/shared/container_repository/rstudio/rocker-geospatial-4.2.1.sif" to run.

Based on my experience so far, I've decided to halt my attempt to replicate this pipeline until I've had a chance to speak with you on it because, assuming my interpretation has been correct so far, there is a lot more one needs to make these -and the many additional scripts I've yet to look through in detail- work such that the pipeline is truly reproducible.

I apologize if I've been curt in this message- I am moreso frustrated by my own inability to resolve these issues internally. I do greatly appreciate any help you're able to provide, and I thank you so much in advance for your time and efforts to make this code reproducible.

Best,

Paige Duffin

jdselwyn commented 4 months ago

Hi Paige,

Apologies for the delay in replying, I wanted to make sure I had run through the code entirely from scratch as a new user. I've added a script (initialize_dir.sh) which creates the required directory structure after you clone the repository and downloads the relevant genome, annotation, and SRA files. I have also included a list of all the software & R pacakges required for the analysis. Given the variety of ways to set up a system and install the software I've left it to the user to go through the scripts and make sure the program calls will work with how you have things installed on your system (with examples of the three main ways I have used).

Specifically the line: RSTUDIO_IMAGE="/shared/container_repository/rstudio/rocker-geospatial-4.2.1.sif refers to a singularity image containing R along with the various packages. That is an example where you need to change the code to target how R is installed on your particular machine.

Please let me know if you run into any difficulties running the code now. I'll leave the issue open for a week then close it. Feel free to open another one if you encounter problems after I've closed the issue.

Jason