NESCent / popgenInfo

Vignettes for Population Genetics in R
http://popgen.nescent.org
MIT License
20 stars 50 forks source link

New RDA vignette #203

Closed BrennaF closed 6 years ago

BrennaF commented 6 years ago

Hi @zkamvar! I've got a new vignette for the site. I think I uploaded everything correctly. Let me know if I messed something up! @smanel is familiar with this vignette and would be a good reviewer. Thanks! Brenna

zkamvar commented 6 years ago

Hi @BrennaF!

It's good to see another vignette from you :) The checks are not passing currently because the website took too long to build (but that shouldn't be too much of a problem).

command make html took more than 10 minutes since last output

One question I have: how long does it take to run the RDA? In my recent experience, it can take a long time (about a half hour) to run an RDA and anova.cca.

BrennaF commented 6 years ago

Yep @zkamvar -- it takes a while (and this is a pretty big data set too). Depends on the computer, but the RDA can take 15-30 minutes....and the anova.cca for significant axes can take multiple hours! It would be nice to run it, though, for the html output so folks can see it, if not run it themselves...

Is it a problem to extend the run time and just let it go until it finishes?

zkamvar commented 6 years ago

I think the computation time does put a bit of a damper on things considering that this will affect the time it takes all future builds to run (@hlapp would be a bit more versed in this). IIRC, reducing the number of loci would speed things up a bit. Do we need to retain all 42k loci, or can you run a PCA and extract the loci with the highest loadings?

BrennaF commented 6 years ago

Hmmm. How about I comment out the test for significant axes but leave the code so people have it for their own analyses? Then we just have to wait for the RDA to run (which is significantly shorter). A second option would be for me to make a new genotype matrix that only includes a subset of the SNPs. Even zipped the genotype data set is currently larger than you all like...

hlapp commented 6 years ago

Hi @BrennaF - awesome you're making another submission!! I do agree with @zkamvar's concern about the impact on all future re-runs.

How about I comment out the test for significant axes but leave the code so people have it for their own analyses? Then we just have to wait for the RDA to run (which is significantly shorter).

I general for vignettes I'd prefer to have things run and show the results, compared to having it commented out. (Commented out code is kind of contradictory to the concept of these provenly reproducible vignettes.) Are you saying reducing the number of loci would no longer lead to meaningful results?

I think it's important to keep in mind for these vignettes that they are not meant to be scientific investigations, papers, or case studies. I.e., there is no expectation that the size of the datasets in any way permits biologically robust conclusions; instead the idea is that they be only large enough to illustrate the kind of insight one would be able to get from applying the method or workflow described in the vignette.

zkamvar commented 6 years ago

Hmmm. How about I comment out the test for significant axes but leave the code so people have it for their own analyses? Then we just have to wait for the RDA to run (which is significantly shorter).

I think this is a good option ๐Ÿ‘. If you have the results yourself, you could include them in the write-up since it's a valuable part of the demonstration. In fact, you wouldn't have to comment those lines out, you can just set the chunk option eval = FALSE.

As for the data, since it's in dryad, it would be better to download the file to a temporary folder (same way you did with the spatial data):

tmp <- tempfile(fileext = ".plink")
download.file("https://datadryad.org/bitstream/handle/10255/dryad.94431/nonAdmix_nacanids_94indiv_unrel_noYNP_42Ksnps_wEcotypes.tped?sequence=1", destfile = tmp)

I'll see what I can do to bump up the build time for the vignettes ๐Ÿ˜ƒ (unless @hlapp vehemently opposes ๐Ÿ˜ค).

zkamvar commented 6 years ago

Ack! I should have waited to comment! Sorry for the mixed signals ๐Ÿ˜…

BrennaF commented 6 years ago

Okay - I'll work on reducing the size of the data set and provide a new version in the next few days! Thanks so much for your input and help!

BrennaF commented 6 years ago

Okay - I reduced the data set and updated the vignette accordingly. The data set is now much smaller and everything should run pretty quickly. Let me know how it goes!

hlapp commented 6 years ago

Yay @BrennaF, all code runs & passes! I've assigned @smanel as one of the two reviewers as per your suggestion.

@zkamvar, do you have a suggestion for who could be the second? Don't hesitate to self-assign if that'd be you ๐Ÿ˜„

zkamvar commented 6 years ago

I will be traveling for the next week, so I think I may not be able to do this justice. As far as other reviewers go, I think this may be in the wheelhouse of either @DrK-Lo or @aurielfournier. Would either of you be up for reviewing this resource?

The HTML file to review can be downloaded here: 2018-03-18_RDA_GEA.html.zip

I built it via docker community edition (my install experience was MUCH easier this time around).

# clone BrennaF's github repo and checkout rda branch
git clone https://github.com/BrennaF/popgenInfo.git # you can also use ssh: git@github.com:BrennaF/popgenInfo.git
cd popgenInfo/
git checkout rda

# pull the docker container
docker pull hlapp/rpopgen:latest # This will be a little over 2GB uncompressed
make html

# open the RDA vignette
firefox build/2018-03-18_RDA_GEA.html
aurielfournier commented 6 years ago

I appreciate you thinking of me. I don't have the genetics background to give this a good review though.

BrennaF commented 6 years ago

@smanel is taking a look through it now. Can one of my co-authors on the paper be a reviewer? If not, I'll try to think of some other people.

BrennaF commented 6 years ago

Thanks @smanel! I've added those suggestions and will upload a new version of the vignette now. Thanks for your feedback!

BrennaF commented 6 years ago

@hlapp, could one of my coauthors on our accepted manuscript (e.g. @hhwagner1 or @jesserlasky) look through the vignette for the second approval? I'd like to get the vignette posted soon so I can include the URL in the final manuscript. Thanks!

hlapp commented 6 years ago

@hlapp, could one of my coauthors on our accepted manuscript (e.g. @hhwagner1 or @jesserlasky) look through the vignette for the second approval?

There isn't a strict rule against that, but can they reasonably assert that there is no COI for them in this review, especially given that the timeline is in your shared interest (hence suggesting not to ask for changes that might threaten that timeline)?

I'd like to get the vignette posted soon so I can include the URL in the final manuscript. Thanks!

Can you say what your timeline is on that?

BrennaF commented 6 years ago

Reasonable enough @hlapp! I haven't received proofs yet, so no huge rush right now. Do I try to find another reviewer? Or do you all do that?

hlapp commented 6 years ago

@BrennaF there's no formal process yet for identifying reviewers ๐Ÿ˜„ If you know someone, please don't hesitate to suggest. They can work closely with you, they just need to be able to assert that either they are not in COI on this review, or if they're not sure whether they are sufficiently free from COI, disclose in brief terms what the COI is.

BrennaF commented 6 years ago

Sounds good - I have three ideas but none of them are on github. I'll cc you on the emails if that works?

hlapp commented 6 years ago

I have three ideas but none of them are on github. I'll cc you on the emails if that works?

Yes. They don't have to be on Github. (Though I don't know why they wouldn't want to be ๐Ÿ˜„ )

BrennaF commented 6 years ago

Hi @hlapp -- I've just uploaded the final version. This version reflects changes made to address comments by both Stephanie and Martin (you received the email with his review?) & should be good to go! Let me know if you have any questions.

This vignette should go under "For SNP data" and a good drop-down title would be: "Detecting multilocus adaptation using Redundancy Analysis"

Thanks for your help with this!

BrennaF commented 6 years ago

Just following up @hlapp -- can we now post to the website?

This vignette should go under "For SNP data" and a good drop-down title would be: "Detecting multilocus adaptation using Redundancy Analysis"

Thanks!

BrennaF commented 6 years ago

@zkamvar can you help? Hilmar must be busy since I haven't heard from him over the past week. I have two reviews - let me know if we can go ahead and post the vignette.

As I noted above, this vignette should go under "For SNP data" and a good drop-down title would be: "Detecting multilocus adaptation using Redundancy Analysis"

Thanks!

zkamvar commented 6 years ago

Hi @BrennaF!

Sorry for the delay; I'm finally getting back on my feet. Lemme rebuild it and upload it here so you can make sure it rendered correctly using the packages on our system (though docker is telling me that it needs to update, so it may be a bit).

Once you approve of that, I'll merge it, add the html to the menubar and it will be live.

zkamvar commented 6 years ago

Hi @BrennaF, sorry it's taking a bit, The document renders well, except for a couple of the citations, which have special characters (รฉ and รง) which have LATIN-1 encoding as opposed to UTF-8. I'm going to make the changes and push them up as soon as the tests finish on this end.

zkamvar commented 6 years ago

Here is the rendered vignette after I fixed the encoding.

2018-03-27_RDA_GEA.zip

zkamvar commented 6 years ago

@BrennaF Please check the above rendered vignette and let me know if it rendered how you expected it to :)

BrennaF commented 6 years ago

Looks great @zkamvar! Thank you so much! :)

zkamvar commented 6 years ago

Congratulations, @BrennaF, your vignette is now live at http://popgen.nescent.org/2018-03-27_RDA_GEA.html!

BrennaF commented 6 years ago

Woo hoo! Thanks @zkamvar !!! :)

hlapp commented 6 years ago

Hi folks - I've been on vacation and away from home. Tomorrow would have been the earliest I could have got back on top of this. Glad to see @zkamvar was able to help out.

@BrennaF this PR is actually not complete yet because we are missing the second review and your responses. I will try to dig them up but you can post them too.

BrennaF commented 6 years ago

Hi @hlapp -- sorry, didn't realize you were out of town. I cc'd you (at your drycafe email) on the review emails with Martin Laporte (second reviewer). I'll paste them here too along with my replies. Let me know if you have any other questions and thanks so much for your help with getting the vignette up!


Martin Laporte Mon 3/26, 2:16 PM OK, really nice vignette ! Not a lot of thing to say. See below:

1- Remove the data/ or add the folder data in the dowload file. 2- Maybe adding information about why and when using individuals versus populations basis matrices could be great. 3- Maybe you could add randomForest algorithm for imputation method: Thierry Gosselin (2017). grur: an R package tailored for RADseq data imputations. R package version 0.0.1 https://github.com/thierrygosselin/grur. doi : 10.5281/zenodo.496176. 4- Maybe addind information about why removing precip_coldest_quarter instead of ann_precip and how we can better objectively select max_temp_warmest_month and min_temps_coldest month among all could be great. 5-Could be great to informed the P-value related to each standard deviation that could be used. 6- About: Based on the strongest correlations, most SNPs are associated with our two precipitation variables (annual precipitation and precipitation seasonality), with temperature variables second most important (mean diurnal range and annual mean temperature). The other four variables are much less important. --> This is true only if each SNPs are equally advantageous for the local adaptation, which is a strong assumption. It can be possible that a single SNPs variant allowing adaptation to an environmental variable change is equivalent to 10 SNPs variants for another environmental variable change.

Martin


Forester,Brenna Mon 3/26, 4:16 PM Awesome Martin, thanks so much! You got back to me before I could even get back to you first :) Sorry about that - I've been in the lab all day. I've attached a new version with my responses (detailed below) & cc'd Hilmar so he knows you've reviewed.

1 - The data folder set-up is designed to match up with the github repository (sorry I didn't explain that when I sent it). 2 - This depends on the sampling design - I tried to clarify and provide a population-level example: "In this case, the data are individual-based, and are input as allele counts (i.e. 0/1/2) for each locus for each individual wolf. For population-based data (e.g., frogs sampled at breeding ponds), you can input the genomic data as allele frequencies within demes." 3 - Great idea - I have made this addition to the text and added the citation. 4 - This really depends on prior ecological/biological knowledge and/or specific hypotheses about what is driving selection. I kept the more general of the two precip variables, since I don't know much about wolf biology. I tried to clarify by adding this: "We only have a few strong correlations. Below, find one option for variable reduction. This could be modified based on ecological and/or biological knowledge about the species or specific hypotheses about the environmental drivers of selection:" 5 - Done. 6 - Great point! I have revised that sentence - thank you so much for catching that. "Based on the strongest correlations, most SNPs are associated with our two precipitation variables (annual precipitation and precipitation seasonality), with temperature variables accounting for the second highest number of detections (mean diurnal range and annual mean temperature). The other four variables are related to a smaller number of detections."

Please let me know if you have any feedback on these changes - if not, I'll submit the final (for now) version to Github. Thanks again!!

Brenna


Martin Laporte Mon 3/26, 6:46 PM Hi, I am on my cell phone right now and could not see the new version. However, for the second point, it could be an idea to give a little guideline, such as if you have different geo coordinate for the majority of your individuals, you should privilege the individuals basis approach to increase the power of the analysis. The idea is to tell the user, if your number of geo coordinate equal the number of pop, don't do the individuals approaches by repeating the same geo coordinates for all individuals of a same pops. Maybe im wrong, but I think this could cause a pseudo replication problems... Martin


Forester,Brenna Mon 3/26, 9:49 PM How about something like this: "In this case, the data are individual-based, and are input as allele counts (i.e. 0/1/2) for each locus for each individual wolf. For population-based data, you can input the genomic data as allele frequencies within demes. The distinction between individual and population based analyses may not be straightforward in all cases. A simple guideline would be to use an individual-based framework when you have individual coordinates for most of your samples, and the resolution of your environmental data (if in raster format) would allow for a sampling of environmental conditions across the site/study area." Does that seem helpful/reasonable? Thanks! Brenna


Martin Laporte Tue 3/27, 4:31 AM I think it sound great now!

hlapp commented 6 years ago

@BrennaF thanks, this is perfect! Looking forward to your next vignette ๐Ÿ˜Š