datadryad / dryad-product-roadmap

Repository of issues for Dryad project boards
https://github.com/orgs/datadryad/projects
8 stars 0 forks source link

Update README template #2726

Closed ahamelers closed 10 months ago

ahamelers commented 11 months ago

The curators are working on edits to the current README template (https://docs.google.com/document/d/1_Lg3xtLFwFbOI697bIuBPPLpQKqYLJ5I7j8UXVlXyIQ/edit)

ahamelers commented 11 months ago

@bryanmgee @jleighherzog @lauradryad

Good notes from Scott on the README pull request

Thank you for doing all this, @ahamelers .

It looks really beautiful and is so much better than our old process and you've put a ton of great work into this.

Filling in the extra README made me wonder how annoyed I would be if I was an author and I needed to re-fill the title (maybe we could pre-fill if no readme exists yet) and also the related works.

I also found myself questioning inclusion of all the related works. I found myself thinking that maybe we should put a link to the DOI or Dryad URL in the readme so someone could get all the rich and up-to-date metadata on the live site in case they lost track of where they got the data files from but I wasn't sure it needed repeating in there.

I also wasn't sure how much some of the stuff overlaps with what they've already filled. Like I can see people just putting in the abstract for "give a brief summary" section (whether that is meant to be that way or not).

I think these are all metadata questions and maybe I'm not the right person to answer them so I might be completely off base. However, as a user if I interpret the readme as filling in information I've already entered and I have to retype it all again, then I'd be annoyed.

bryanmgee commented 11 months ago

My two cents:

ahamelers commented 11 months ago

Usage Notes: this would be useful, but it's used quite variably and sometimes could slot into the Code/Software section of the README and other times would not line up

Based on the UX testing feedback, we're getting rid of the separate Usage notes now that the README will be required (for submissions after this release it will only appear if someone has already put stuff there before the release). The thinking was, based on the instructions for the Usage notes and the README, it is content they are meant to put in the README and is entirely duplicative. Let me know if you disagree—if there is some reason for a separate Usage notes it will need to be really differentiated for our confused users.

bryanmgee commented 11 months ago

No, I like that, we get a lot of garbage in the Usage Notes right now

jleighherzog commented 11 months ago

Re: Methods Do we remove it from the README because it's already captured in the metadata?

ahamelers commented 11 months ago

Re: Methods Do we remove it from the README because it's already captured in the metadata?

As far as I am aware, currently the README template does not ask for Methods.

(If you're asking if curators should remove it if someone puts it in there, my vote is not to worry about it. Do I get a vote?)

jleighherzog commented 10 months ago

@bryanmgee @lauradryad @ryscher @ahamelers

Met with Bryan and Laura, we have the following suggestions/requests: 1) Keep template as-is. We'll revisit based on feedback/HD messages received. 2) Keep Methods section on dataset landing page; good to have for those submissions that don't have scholarly outputs 3) Possible to auto-transfer the title into the template? Often we ask authors to update their titles because they are insufficient (e.g. "Raw Data"). If it's not auto-updated or generated, it's another check for the curators to perform. 4) Possible to auto-populate and include DOI and citation in the README file? 5) Q: Remove README from landing page. We anticipate there will be formatting issues and readability concerns, curators will need to heavily copyedit the README text because it will be displayed front and center on the landing page. Sometimes these README files can be a giant wall of text that will bury the other metadata fields and (Funding, etc.) 6) Q: Should there be a "Beta" label and/or feedback button for this feature? A link?

Will these blue boxes be rollout out too? Will this box disappear upon publication of the dataset? Screenshot 2023-08-25 at 6 47 11 PM

jleighherzog commented 10 months ago

Example of how a README might display on the landing page:

README.md

This README file is for the data published in "An adaptive biomolecular condensation response is conserved across environmentally divergent species" by Keyport Kik et al. (2023) (preprint: https://www.biorxiv.org/content/10.1101/2023.07.28.551061v1). These folders also contain custom R (version 4.2.2) scripts for data processing, analysis, and figure generation, organized into separate experimental folders. Some data have a separate R code for data processing and exporting data. The ordering of the figures may be different from what's shown in the paper. The code used to process raw RNA-seq data can be found on Github (https://github.com/https://github.com/skeyport/conservation-of-condensation-2023). The rest of the data and scripts can be found on DataDryad (https://doi.org/10.5061/dryad.w3r2280w6). Below is a description of the contents of each folder. Raw RNA-seq data can be found under GEO accession code GSE234499.


supp-files/

Within this folder are three supplemtanal files: suppfile1/ (master-orthologs.txt) - contains ortholog calls for all three species, including gene and ORF name from S. cerevisiae suppfile2/ (230215_labeled_genes_scer_all.tsv) - were derived from the following sources. The targets of HSF1 and Msn2/4 were curated from Pincus et al. 2018 and Solís et al. 2016. The genes for core ribosomal proteins, ribosome biogenesis factors, and glycolytic enzymes (superpathway of glucose fermentation) as well as transcription factor regulation assignments were derived from the Saccharomyces Genome Database (Cherry et al. 2012; Engel et al. 2014, https://www.yeastgenome.org/). Genes for translation factors were derived from the KEGG BRITE database (Kanehisa et al. 2016).
suppfile3/ (20-04-21_sgd-all-regulators2.tsv) - Transcription factor regulators were assigned according to (Triandafillou et al. 2020).


fig-svgs/

Final Figures 1-5 as well as Supplemental Figures 1-3 (in *.svg format) can be found in this folder. Table 1 and Table 2 can also be found here as *.txt files.


dryad-upload/

Seven folders contain raw and processed data as well as R or Python scripts (open source versions) to produce the figures in the paper. All required packages and dependencies are listed at the top of each script. Both Windows v11 and MacOS can run all scripts. After downloading the .zip file, data are organized to be able to run efficiently within the structured directories in which they are found. Expected (actual) outputs are included for each script. To run all of the scripts and produce the figures should require less than a day's time. All input files are contained in this folder to test the code. Instructions for use are below. If there is a specific order to run the code, it is outlined below (e.g., condensation-ms/); in all other cases, scripts can be run in any order.

condensation-ms/ - mass spectrometry was performed at different control and treatment temperatures for each species. Outputs from Scaffold DIA 3.3.1 analysis perforfmed by MS Bioworks are in the data/ folder (MSB-9658A U. Chicago Keyport 042622.txt - text file of S. kudriavzevii processed data from MS Bioworks, MSB-9658B U. Chicago Keyport 042722.txt - text file of S. cerevisiae processed data from MS Bioworks, MSB-10835 U. Chicago Keyport 022223.txt - text file of K. marxianus processed data from MS Bioworks). Other sample name and proteome information called by the scripts are contained in data/, including conditions.txt (contains sample conditions), sample_names.txt (experimental names and assoiciated data), kmarx-tsp-by-condition.txt (tidy version of raw intensity data for K. marxianus). All analyses are performed in RStudio (R version 4.2.2). Analysis script process-raw-dia.Rmd processes raw data for S. cerevisiae and S. kudriavzevii and uses raw MS data, sample_names.txt, uniprot-gene-orf.txt (accession and gene information from Uniprot), and skud_proteome_ygob.fasta (*.fasta file for S. kudriavzevii from YGOB). This script produces results/processed_data.tsv, which is an input for mixing ratio calculation in calculate-mixing-dia.Rmd along with data/kmarx-tsp-by-condition.txt and data/scer-kmarx-skud-orthologs.txt (curated ortholog list for each species); this script outputs two files (results/230501_mixing_ratios.tsv and results/psups-three-species.txt). psup-three-species.txt was transformed into psup_wide_data.txt which is then used in conservation-of-condensation.Rmd, in combination with data/scer-anno-proteins.txt (annotations of protein processes as described in the Methods section, "Gene Annotation") and 230501_mixing_ratios.tsv. This script, conservation-of-condensation.Rmd, ultimately produces the figures (found in figures/). Final figures can be found in the fig-svg/ directory. A subset of custom functions called in scripts are defined in utilityFunctions.R.

dls/ - raw measurements from DLS temperature ramp experiments are contained in directories with dates as names as *.csv files, and associated sample information is also contained as *-samples.csv. All analyses are performed in RStudio (R version 4.2.2). Analysis script analyze-dls.Rmd reads in sample information and raw data to produce Figures 4a-d (output in in figs/ folder). Raw data from Riback & Katanski et al. 2017 can be found in riback-2017/ directory (WT Pab1 in Pab1_15uM_DLSbuffpH6p4.csv, MV to A Pab1 in Pab1_MVtoA_15uM_pH6p4_4_14_15_1.csv, and MV to I Pab1 in Pab1_MVtoI_15uM_pH6p4.csv), and is also used in analyze-dls.Rmd. Growth data to produce Figure 4b are generated from scripts in growth-curves/ and output into this directory, including conf-int-topt.txt (confidence intervals from esptimation of optimal temperature in Figure 1b), hsr-growth-max.txt (maximum estimated growth temperature and heat shock temperature for each species), and stop-growing.txt (estimations the temperatures at which each species stops growing). Table S1 (table1.txt) and Table S2 (table2.txt) are output from the analyze-dls.Rmd script and contain baseline size calculations for each protein (S1) and temperature and doubling size of condensation for each protein (S2). Final figures can be found in the fig-svg/ directory.

flow-cytometry/ - raw *.fcs files of each species +/- heat shock at various temperatures with 180 min of recovery are found here with condition discriptors in file names and contained directories with dates as names. All analyses are performed in RStudio (R version 4.2.2), and requires custom but publically available R packages. These custom packages can be found on GitHub (https://github.com/ctriandafillou/flownalysis; https://github.com/ctriandafillou/cat.extras). The script 20220414-skud-scer-manytemps-endpoint.Rmd reads in *.fcs files, processes flow data, and produces Figure S2a (found in output/results/). Another output file, max-fc-hsr.txt, which contains the fold change and statistics of maximum heat shock response for each species at each temperature, is produced here and also exported to growth-cruves/data/ which will in part produce Figure 1c-d (also found in output/results/). Final figures can be found in the fig-svg/ directory.

growth-curves/ - raw OD600 data for each WT species is found in 20230311/20230311.txt and was sampled from log-phase growing yeast grown at different temperatures. All analyses are performed in RStudio (R version 4.2.2). Two R scripts are contained in this directory. First, growth-curve-estimation.R calculates the maximum specific growth rates by estimating the slope of the linear range of growth for each species and temperature. Resulting growth curves are then fit using the cardinal temperature model with inflection (Rosso, Lobry, and Flandrois 1993), and temperature of maximum growth is estimated from these curves and output into data/dat-max.txt. Optimal temperature statistics are calculated as confidence interval (results/conf-int-topt.txt, also output into ../dls/) and standard deviation (results/topt-sd.txt). This script is also used to estimate the temperature at which each species stops growing, and is output as data/stop-growing.txt (also output into ../dls/). growth-curve-estimation.R generates Figure 1b (found in results/). The second script, plot-growth-hsr-correlation.Rmd takes the output files from the first script as well as data/max-fc-hsr.txt produced from ../flow-cytometry/20220414-skud-scer-manytemps-endpoint.Rmd to produce Figures 1c and 1d as well as results/hsr-growth-max.txt, which contains temperatures optima for growth and heat shock responses for each species. Final figures can be found in the fig-svg/ directory.

hdx/ - HDX-MS data were collected for each species' Pab1 before and after condensation. Raw data for each species as *_peptideUptake.csv, as well as manually curated domain boudaries (20221214_pab1DomainBoundaries) and secondary structure predictions (domain-ss.txt) from Schäfer et al. 2019 are found in this folder. Python3 scripts (extract-hdx.py, hdx.py, and hdx_test.py) produce %D values for each residue in aligned Pab1 using aligned-positions.txt based on the alignment (align-pab1.txt) that are used to compute means across the sequences. The outputs for Python3 scripts are in output/*hdx.txt. Figure 5a-f generation (each found in output/) and other data processing are done in RStudio (R version 4.2.2) in 20230328_HDX_Pab1_SKK.Rmd. Final figures can be found in the fig-svg/ directory.

rna-seq/ - Each species was treated with a species-specific heat shock (or control temperature) and then submitted for RNA-Seq. Upstream pre-processing was performed with a custom Snakemake pipeline which can be found on Github (https://github.com/skeyport/conservation-of-condensation-2023). Downstream processing and analyses as well as figure generation are performed in RStudio (R version 4.2.2). The script 230503_analyze_counts.Rmd aggregates count data (counts/*_counts.tsv) and sample information (20230309-sample-info.txt, also in data/) to produce output/20230309-counts.tsv. Gene lengths were extracted for each gene by first adding exon annotations to the GTF files (*_genomic.gtf or *_genomic.cleaned.gtf) using a custom script based on gffutils v0.11.1 (https://github.com/daler/gffutils, also in this directory as gffutils_fix_missing_exon.py). Gene lengths were then calculated using the GenomicFeatures package in R. These lengths (found in output/*_merged_exon_length.tsv) are then used to calculate transcript per million values (TPMs, output/20230309_TPM.tsv, also in data/) using the counts output (output/20230309-counts.tsv). Fold changes in transcript abundance were calculated using DESeq2 v3.16 (output/20230309-deseq.tsv). We used pre-published genome annotations src/label_Scer_genes/Saccharomyces-kudriavzevii-ZP591_genome.tab from YGOB v8 (beta) to match RNA-Seq data to S. kudriavzevii strain ZP591. Figures 2a-2d and Figures S3a-c are produced using 230503_analyze_counts.Rmd. Gene annotations were assigned in src/label_Scer_genes/230215-label-genes.Rmd, where targets of HSF1 and Msn2/4 were curated from Solis et al., 2016 and Pincus et al., 2018 (Solis_2016/mmc3.xlsx and 200325-scer-features.txt (also in data/)). The genes for core ribosomal proteins (scer-ribosomal-proteins.txt), ribosome biogenesis factors (ribosome_biogenesis_annotations_sgd.txt), and glycolytic enzymes (superpathway of glucose fermentation, SGD_superpathway_glycolysis_221025.txt) as well as transcription factor regulation assignments (20-04-21_sgd-all-regulators2.tsv and tfs_and_targets_heatshock.tsv) were derived from the Saccharomyces Genome Database (https://www.yeastgenome.org/). Genes for translation factors were derived from the KEGG BRITE database (translation_factors_kegg.tsv). The annotation gene file was output into 230215_labeled_genes_scer_all.tsv, which is subsequently used in 230503_analyze_counts.Rmd to assign gene annotations. The script to produce the upset plots from Figures S2b-c is txn-comparison.Rmd, and also takes data from Brion et al. 2016 (data/brion2016-lkluyveri-stress-seq.txt), sample information (/data/20230309-sample-info.txt), orthologs (data/master-orthologs.txt), TPMs (data/20230309_TPM.tsv), and gene annotations (/data/230508_labeled_genes_scer.tsv). GEO upload TPMs and counts (Keyport-Kik_2023_TPMs.tsv and Keyport-Kik_2023_counts.tsv are found in output/ as well, and are identical to output/20230309_*.txt). A subset of custom functions called in scripts are defined in utilityFunctions.R. A file containing orthologs for all species (also Supplemental File 1), master-orthologs.txt is found in data/ and src/label_Scer_genes/ and is used as an input in the script 230503_analyze_counts.Rmd. Final figures can be found in the fig-svg/ directory.

spot-assays/ - Spot assays and plate growth assays were performed for each species at corresponding temperatures. Raw *.tif images for Figure 4e and Figure S1a-c (organized as directories) are contained here. Naming conventions for images in fig4e/ and figs1e/are [date][time][strain/species][duration][temperature]. Naming conventions for images in figs1a/are [plate][duration][temp][species]. Naming conventions for images in figs1c/are [date][time][temperature][duration]. Final figures can be found in the fig-svg/ directory.

ahamelers commented 10 months ago
  1. Possible to auto-transfer the title into the template? Often we ask authors to update their titles because they are insufficient (e.g. "Raw Data"). If it's not auto-updated or generated, it's another check for the curators to perform.

  2. Possible to auto-populate and include DOI and citation in the README file?

Yes, both of these are possible.

  1. Q: Remove README from landing page. We anticipate there will be formatting issues and readability concerns, curators will need to heavily copyedit the README text because it will be displayed front and center on the landing page. Sometimes these README files can be a giant wall of text that will bury the other metadata fields and (Funding, etc.)

We discussed this a lot in the curator meeting today. The README is meant to be the main file representing the dataset regardless of how it is accessed. We did not know the level of copy-editing you have already been doing, and did not intend for the README to need to be copy-edited, but it has been intended to be in the landing page for some time. For now, we will leave the README on the landing page, but restrict the size of what is shown, and have a “view more” button.

  1. Q: Should there be a "Beta" label and/or feedback button for this feature? A link?

I thought I could answer this but now I'm not 100% sure which feature you mean? For the README generation as part of the submission, that should feature heavily in the submission feedback survey we are developing (#2699 )

Will these blue boxes be rollout out too? Will this box disappear upon publication of the dataset? Screenshot 2023-08-25 at 6 47 11 PM

Yes and yes—these are just an altered design of the existing citation and sharing link sections, so they take up less space, but draw the eye a little more, and have a built-in copying function.