Build gene-level NAc plink bed files

lcolladotor commented 4 years ago

This task involves https://github.com/LieberInstitute/10xPilot_snRNAseq-human/blob/master/twas/build_bims.R and https://github.com/LieberInstitute/10xPilot_snRNAseq-human/blob/master/twas/build_bims.sh.

build_bims.sh

Edit so it'll run for the NAc brain region only, and only at the gene feature level. That is https://github.com/LieberInstitute/10xPilot_snRNAseq-human/blob/1e1d2bb6627b06c22decdd642178bd25f9aeb42e/twas/build_bims.sh#L10 and https://github.com/LieberInstitute/10xPilot_snRNAseq-human/blob/1e1d2bb6627b06c22decdd642178bd25f9aeb42e/twas/build_bims.sh#L14 plus maybe the memory settings lower.

Edit to use R 4.0 at https://github.com/LieberInstitute/10xPilot_snRNAseq-human/blob/1e1d2bb6627b06c22decdd642178bd25f9aeb42e/twas/build_bims.sh#L64. This might involve on your JHPCE setup to run something like:

# module load plink
# install_github('gabraham/plink2R/plink2R')
# https://github.com/gusevlab/fusion_twas/issues/14
install_github("carbocation/plink2R/plink2R", ref="carbocation-permit-r361")

Edit the memory use if needed since the NAc data is smaller than the BrainSeq Phase II data.

build_bims.R

Edit to have a test input case for the NAc at https://github.com/LieberInstitute/10xPilot_snRNAseq-human/blob/1e1d2bb6627b06c22decdd642178bd25f9aeb42e/twas/build_bims.R#L39-L47

Edit to add a NAc load_rse() function. The BSP2 one is at https://github.com/LieberInstitute/10xPilot_snRNAseq-human/blob/1e1d2bb6627b06c22decdd642178bd25f9aeb42e/twas/build_bims.R#L56-L113 and the DentateGyrus is at https://github.com/LieberInstitute/10xPilot_snRNAseq-human/blob/1e1d2bb6627b06c22decdd642178bd25f9aeb42e/twas/build_bims.R#L115-L181. If we have degradation data for NAc, we might need to run the qSVA code for NAc then equivalent to https://github.com/LieberInstitute/10xPilot_snRNAseq-human/blob/1e1d2bb6627b06c22decdd642178bd25f9aeb42e/twas/build_bims.R#L115-L181. However, even without qSVA we can get the rest of the code up and running and I can help do this more advanced part.

Edit to load the NAc data at https://github.com/LieberInstitute/10xPilot_snRNAseq-human/blob/1e1d2bb6627b06c22decdd642178bd25f9aeb42e/twas/build_bims.R#L193-L203 which is the output from #2.

Ignore the code at https://github.com/LieberInstitute/10xPilot_snRNAseq-human/blob/1e1d2bb6627b06c22decdd642178bd25f9aeb42e/twas/build_bims.R#L221-L248 which was done for an earlier test in the BSP2 project.

Remember to subset the expression data as in https://github.com/LieberInstitute/10xPilot_snRNAseq-human/blob/1e1d2bb6627b06c22decdd642178bd25f9aeb42e/twas/build_bims.R#L277 when running the code interactively to test that everything works.

Edit to make the file name consistent at https://github.com/LieberInstitute/10xPilot_snRNAseq-human/blob/1e1d2bb6627b06c22decdd642178bd25f9aeb42e/twas/build_bims.R#L299 with the one from #2.

Once you are happy with the local tests, you might want to fully run the code by executing sh build_bims.sh even if the expression data is not "cleaned" from the qSVA side yet. This will give us an idea of how long its taking and if there are any unexpected issues to iron out. We can then fix the qSVA code and then re-run the script.

aseyedia commented 4 years ago

I simplified build_bims_NAc_genes.sh such that it no longer generated shell scripts and instead directly calls build_bims.R.

Additionally, I made many of the recommended changes including making build_bims.R specific to NAc at the gene feature level.

However, I am not yet ready to run this script because there are still a few problems.

https://github.com/LieberInstitute/10xPilot_snRNAseq-human/blob/1e1d2bb6627b06c22decdd642178bd25f9aeb42e/twas/build_bims.R#L76-L78

In the .RData file I used (which I discovered at /dcl01/lieber/ajaffe/lab/Nicotine/NAc/RNAseq/paired_end_n239/processed_data/rse_gene_NAc_NicotineGrant_hg38_Sept28_2017_n239.Rdata, which I chose on the basis that it seemed to correspond to the rse object that was used in the original script, see below), I was able to extract "RPKM" data like in the original script and assign it to assays(rse)$raw_expr. However, this rse_gene/rse object does not contain a column dedicated to the age of the sample's subject, unlike the original rse that was there when this project was used for hippocampal/DLPFC/Dentate Gyrus tissues.

https://github.com/LieberInstitute/10xPilot_snRNAseq-human/blob/1e1d2bb6627b06c22decdd642178bd25f9aeb42e/twas/build_bims.R#L58-L61

Another issue I has was that I could not find any files in the NAc project directory or any of the subdirectories containing principal components for NAc samples.

https://github.com/LieberInstitute/10xPilot_snRNAseq-human/blob/1e1d2bb6627b06c22decdd642178bd25f9aeb42e/twas/build_bims.R#L80-L106

This gets in the way of load_rse() working as a function. I could try to run it with all of the unavailable parts commented out, but I just don't know how essential they are the the proper execution of this script.

Furthermore, cleaningY on line 105 is a custom function that's part of jaffelab. As a package, jaffelab does not work with conda_R/4.0.x and for some reason cannot be loaded into R/3.6. I used source("../jaffelab/R/cleaningY.R") though we'll probably have to find a more elegant solution.

There might be more, but due to these reasons I don't feel comfortable executing this script just yet. I'll look at the remaining two issues and if I can't make any progress on them without finishing this one, I'll see if I can at least run it at limited capacity.

aseyedia commented 4 years ago

After fixing my copy of R to work with jaffelab and finding the appropriate genotype file (/dcl01/lieber/ajaffe/lab/Nicotine/NAc/RNAseq/paired_end_n239/genotype_data/Nicotine_NAc_Genotypes_n206.rda) for the generation/use of PCs and the model matrix on line 98:

    mod <- model.matrix(~ Dx + Sex + snpPC1 + snpPC2 + snpPC3 + snpPC4 + snpPC5 + pcs,
        data = colData(rse)

I have more or less completed my work on the load_rse() function. I've run the function and left the rse object at /users/aseyedia/NAc_TWAS/rse.Rdata but I can move it somewhere more appropriate/rename it to something more relevant. I'm going to forge ahead and run the rest of the script and write the bim files while changing the input files where appropriate.

aseyedia commented 4 years ago

Finished working on build_bims.R and its accompanying wrapper script, build_bims_NAc_genes.sh. I'm going to test it now and report back with any potential errors.

aseyedia commented 4 years ago

For some reason, I can't get the getopt flags to work. I'm getting the following error message:

Error in storage.mode(peek.optstring) <- mode : invalid value
Calls: getopt ... tryCatch -> tryCatchList -> tryCatchOne -> doTryCatch
Execution halted

I've set default arguments and the wrapper script seems to call R correctly, so I'm not sure exactly what the problem is. If I can't figure it out, I'll test the script interactively.

aseyedia commented 4 years ago

Leo fixed the getopt error for me on Tuesday and we briefly had a moment where the bim input was presenting a chr23 where the rse object only had chrX, chrY, chrM. It was quickly resolved, however, and I'm pleased to report that the script is basically complete. Let me know if you have any questions; I documented my changes in the commit messages but I can expound on them further if you need me to. Briefly:

made the bim and rse objects consistent in their naming scheme for the chromosomes
created a couple of new flags for the build_bims.R, -t test T/F -d degradation T/F
requested more memory and cores
filter bims written to file in the bim_files fxn to unique IDs only (as per plink's request)

LieberInstitute / 10xPilot_snRNAseq-human

Build gene-level NAc plink bed files #3

build_bims.sh

build_bims.R