Closed ahamelers closed 10 months ago
@bryanmgee @jleighherzog @lauradryad
Good notes from Scott on the README pull request
Thank you for doing all this, @ahamelers .
It looks really beautiful and is so much better than our old process and you've put a ton of great work into this.
Filling in the extra README made me wonder how annoyed I would be if I was an author and I needed to re-fill the title (maybe we could pre-fill if no readme exists yet) and also the related works.
I also found myself questioning inclusion of all the related works. I found myself thinking that maybe we should put a link to the DOI or Dryad URL in the readme so someone could get all the rich and up-to-date metadata on the live site in case they lost track of where they got the data files from but I wasn't sure it needed repeating in there.
I also wasn't sure how much some of the stuff overlaps with what they've already filled. Like I can see people just putting in the abstract for "give a brief summary" section (whether that is meant to be that way or not).
I think these are all metadata questions and maybe I'm not the right person to answer them so I might be completely off base. However, as a user if I interpret the readme as filling in information I've already entered and I have to retype it all again, then I'd be annoyed.
My two cents:
Usage Notes: this would be useful, but it's used quite variably and sometimes could slot into the Code/Software section of the README and other times would not line up
Based on the UX testing feedback, we're getting rid of the separate Usage notes now that the README will be required (for submissions after this release it will only appear if someone has already put stuff there before the release). The thinking was, based on the instructions for the Usage notes and the README, it is content they are meant to put in the README and is entirely duplicative. Let me know if you disagree—if there is some reason for a separate Usage notes it will need to be really differentiated for our confused users.
No, I like that, we get a lot of garbage in the Usage Notes right now
Re: Methods Do we remove it from the README because it's already captured in the metadata?
Re: Methods Do we remove it from the README because it's already captured in the metadata?
As far as I am aware, currently the README template does not ask for Methods.
(If you're asking if curators should remove it if someone puts it in there, my vote is not to worry about it. Do I get a vote?)
@bryanmgee @lauradryad @ryscher @ahamelers
Met with Bryan and Laura, we have the following suggestions/requests: 1) Keep template as-is. We'll revisit based on feedback/HD messages received. 2) Keep Methods section on dataset landing page; good to have for those submissions that don't have scholarly outputs 3) Possible to auto-transfer the title into the template? Often we ask authors to update their titles because they are insufficient (e.g. "Raw Data"). If it's not auto-updated or generated, it's another check for the curators to perform. 4) Possible to auto-populate and include DOI and citation in the README file? 5) Q: Remove README from landing page. We anticipate there will be formatting issues and readability concerns, curators will need to heavily copyedit the README text because it will be displayed front and center on the landing page. Sometimes these README files can be a giant wall of text that will bury the other metadata fields and (Funding, etc.) 6) Q: Should there be a "Beta" label and/or feedback button for this feature? A link?
Will these blue boxes be rollout out too? Will this box disappear upon publication of the dataset?
Example of how a README might display on the landing page:
This README file is for the data published in "An adaptive biomolecular condensation response is conserved across environmentally divergent species" by Keyport Kik et al. (2023) (preprint: https://www.biorxiv.org/content/10.1101/2023.07.28.551061v1). These folders also contain custom R (version 4.2.2) scripts for data processing, analysis, and figure generation, organized into separate experimental folders. Some data have a separate R code for data processing and exporting data. The ordering of the figures may be different from what's shown in the paper. The code used to process raw RNA-seq data can be found on Github (https://github.com/https://github.com/skeyport/conservation-of-condensation-2023). The rest of the data and scripts can be found on DataDryad (https://doi.org/10.5061/dryad.w3r2280w6). Below is a description of the contents of each folder. Raw RNA-seq data can be found under GEO accession code GSE234499.
supp-files/
Within this folder are three supplemtanal files:
suppfile1/
(master-orthologs.txt
) - contains ortholog calls for all three species, including gene and ORF name from S. cerevisiaesuppfile2/
(230215_labeled_genes_scer_all.tsv
) - were derived from the following sources. The targets of HSF1 and Msn2/4 were curated from Pincus et al. 2018 and Solís et al. 2016. The genes for core ribosomal proteins, ribosome biogenesis factors, and glycolytic enzymes (superpathway of glucose fermentation) as well as transcription factor regulation assignments were derived from the Saccharomyces Genome Database (Cherry et al. 2012; Engel et al. 2014, https://www.yeastgenome.org/). Genes for translation factors were derived from the KEGG BRITE database (Kanehisa et al. 2016).
suppfile3/
(20-04-21_sgd-all-regulators2.tsv
) - Transcription factor regulators were assigned according to (Triandafillou et al. 2020).
fig-svgs/
Final Figures 1-5 as well as Supplemental Figures 1-3 (in
*.svg
format) can be found in this folder. Table 1 and Table 2 can also be found here as*.txt
files.
dryad-upload/
Seven folders contain raw and processed data as well as R or Python scripts (open source versions) to produce the figures in the paper. All required packages and dependencies are listed at the top of each script. Both Windows v11 and MacOS can run all scripts. After downloading the .zip file, data are organized to be able to run efficiently within the structured directories in which they are found. Expected (actual) outputs are included for each script. To run all of the scripts and produce the figures should require less than a day's time. All input files are contained in this folder to test the code. Instructions for use are below. If there is a specific order to run the code, it is outlined below (e.g., condensation-ms/); in all other cases, scripts can be run in any order.
condensation-ms/ - mass spectrometry was performed at different control and treatment temperatures for each species. Outputs from Scaffold DIA 3.3.1 analysis perforfmed by MS Bioworks are in the
data/
folder (MSB-9658A U. Chicago Keyport 042622.txt
- text file of S. kudriavzevii processed data from MS Bioworks,MSB-9658B U. Chicago Keyport 042722.txt
- text file of S. cerevisiae processed data from MS Bioworks,MSB-10835 U. Chicago Keyport 022223.txt
- text file of K. marxianus processed data from MS Bioworks). Other sample name and proteome information called by the scripts are contained indata/
, includingconditions.txt
(contains sample conditions),sample_names.txt
(experimental names and assoiciated data),kmarx-tsp-by-condition.txt
(tidy version of raw intensity data for K. marxianus). All analyses are performed in RStudio (R version 4.2.2). Analysis scriptprocess-raw-dia.Rmd
processes raw data for S. cerevisiae and S. kudriavzevii and uses raw MS data,sample_names.txt
,uniprot-gene-orf.txt
(accession and gene information from Uniprot), andskud_proteome_ygob.fasta
(*.fasta
file for S. kudriavzevii from YGOB). This script producesresults/processed_data.tsv
, which is an input for mixing ratio calculation incalculate-mixing-dia.Rmd
along withdata/kmarx-tsp-by-condition.txt
anddata/scer-kmarx-skud-orthologs.txt
(curated ortholog list for each species); this script outputs two files (results/230501_mixing_ratios.tsv
andresults/psups-three-species.txt
).psup-three-species.txt
was transformed intopsup_wide_data.txt
which is then used inconservation-of-condensation.Rmd
, in combination withdata/scer-anno-proteins.txt
(annotations of protein processes as described in the Methods section, "Gene Annotation") and230501_mixing_ratios.tsv
. This script,conservation-of-condensation.Rmd
, ultimately produces the figures (found infigures/
). Final figures can be found in thefig-svg/
directory. A subset of custom functions called in scripts are defined inutilityFunctions.R
.dls/ - raw measurements from DLS temperature ramp experiments are contained in directories with dates as names as
*.csv
files, and associated sample information is also contained as*-samples.csv
. All analyses are performed in RStudio (R version 4.2.2). Analysis scriptanalyze-dls.Rmd
reads in sample information and raw data to produce Figures 4a-d (output in infigs/
folder). Raw data from Riback & Katanski et al. 2017 can be found inriback-2017/
directory (WT Pab1 inPab1_15uM_DLSbuffpH6p4.csv
, MV to A Pab1 inPab1_MVtoA_15uM_pH6p4_4_14_15_1.csv
, and MV to I Pab1 inPab1_MVtoI_15uM_pH6p4.csv
), and is also used inanalyze-dls.Rmd
. Growth data to produce Figure 4b are generated from scripts ingrowth-curves/
and output into this directory, includingconf-int-topt.txt
(confidence intervals from esptimation of optimal temperature in Figure 1b),hsr-growth-max.txt
(maximum estimated growth temperature and heat shock temperature for each species), andstop-growing.txt
(estimations the temperatures at which each species stops growing). Table S1 (table1.txt
) and Table S2 (table2.txt
) are output from theanalyze-dls.Rmd
script and contain baseline size calculations for each protein (S1) and temperature and doubling size of condensation for each protein (S2). Final figures can be found in thefig-svg/
directory.flow-cytometry/ - raw
*.fcs
files of each species +/- heat shock at various temperatures with 180 min of recovery are found here with condition discriptors in file names and contained directories with dates as names. All analyses are performed in RStudio (R version 4.2.2), and requires custom but publically available R packages. These custom packages can be found on GitHub (https://github.com/ctriandafillou/flownalysis; https://github.com/ctriandafillou/cat.extras). The script20220414-skud-scer-manytemps-endpoint.Rmd
reads in*.fcs
files, processes flow data, and produces Figure S2a (found inoutput/results/
). Another output file,max-fc-hsr.txt
, which contains the fold change and statistics of maximum heat shock response for each species at each temperature, is produced here and also exported togrowth-cruves/data/
which will in part produce Figure 1c-d (also found inoutput/results/
). Final figures can be found in thefig-svg/
directory.growth-curves/ - raw OD600 data for each WT species is found in
20230311/20230311.txt
and was sampled from log-phase growing yeast grown at different temperatures. All analyses are performed in RStudio (R version 4.2.2). Two R scripts are contained in this directory. First,growth-curve-estimation.R
calculates the maximum specific growth rates by estimating the slope of the linear range of growth for each species and temperature. Resulting growth curves are then fit using the cardinal temperature model with inflection (Rosso, Lobry, and Flandrois 1993), and temperature of maximum growth is estimated from these curves and output intodata/dat-max.txt
. Optimal temperature statistics are calculated as confidence interval (results/conf-int-topt.txt
, also output into../dls/
) and standard deviation (results/topt-sd.txt
). This script is also used to estimate the temperature at which each species stops growing, and is output asdata/stop-growing.txt
(also output into../dls/
).growth-curve-estimation.R
generates Figure 1b (found inresults/
). The second script,plot-growth-hsr-correlation.Rmd
takes the output files from the first script as well asdata/max-fc-hsr.txt
produced from../flow-cytometry/20220414-skud-scer-manytemps-endpoint.Rmd
to produce Figures 1c and 1d as well asresults/hsr-growth-max.txt
, which contains temperatures optima for growth and heat shock responses for each species. Final figures can be found in thefig-svg/
directory.hdx/ - HDX-MS data were collected for each species' Pab1 before and after condensation. Raw data for each species as
*_peptideUptake.csv
, as well as manually curated domain boudaries (20221214_pab1DomainBoundaries
) and secondary structure predictions (domain-ss.txt
) from Schäfer et al. 2019 are found in this folder. Python3 scripts (extract-hdx.py
,hdx.py
, andhdx_test.py
) produce %D values for each residue in aligned Pab1 usingaligned-positions.txt
based on the alignment (align-pab1.txt
) that are used to compute means across the sequences. The outputs for Python3 scripts are inoutput/*hdx.txt
. Figure 5a-f generation (each found inoutput/
) and other data processing are done in RStudio (R version 4.2.2) in20230328_HDX_Pab1_SKK.Rmd
. Final figures can be found in thefig-svg/
directory.rna-seq/ - Each species was treated with a species-specific heat shock (or control temperature) and then submitted for RNA-Seq. Upstream pre-processing was performed with a custom Snakemake pipeline which can be found on Github (https://github.com/skeyport/conservation-of-condensation-2023). Downstream processing and analyses as well as figure generation are performed in RStudio (R version 4.2.2). The script
230503_analyze_counts.Rmd
aggregates count data (counts/*_counts.tsv
) and sample information (20230309-sample-info.txt
, also indata/
) to produceoutput/20230309-counts.tsv
. Gene lengths were extracted for each gene by first adding exon annotations to the GTF files (*_genomic.gtf
or*_genomic.cleaned.gtf
) using a custom script based on gffutils v0.11.1 (https://github.com/daler/gffutils, also in this directory asgffutils_fix_missing_exon.py
). Gene lengths were then calculated using the GenomicFeatures package in R. These lengths (found inoutput/*_merged_exon_length.tsv
) are then used to calculate transcript per million values (TPMs,output/20230309_TPM.tsv
, also indata/
) using the counts output (output/20230309-counts.tsv
). Fold changes in transcript abundance were calculated using DESeq2 v3.16 (output/20230309-deseq.tsv
). We used pre-published genome annotationssrc/label_Scer_genes/Saccharomyces-kudriavzevii-ZP591_genome.tab
from YGOB v8 (beta) to match RNA-Seq data to S. kudriavzevii strain ZP591. Figures 2a-2d and Figures S3a-c are produced using230503_analyze_counts.Rmd
. Gene annotations were assigned insrc/label_Scer_genes/230215-label-genes.Rmd
, where targets of HSF1 and Msn2/4 were curated from Solis et al., 2016 and Pincus et al., 2018 (Solis_2016/mmc3.xlsx
and200325-scer-features.txt
(also indata/
)). The genes for core ribosomal proteins (scer-ribosomal-proteins.txt
), ribosome biogenesis factors (ribosome_biogenesis_annotations_sgd.txt
), and glycolytic enzymes (superpathway of glucose fermentation,SGD_superpathway_glycolysis_221025.txt
) as well as transcription factor regulation assignments (20-04-21_sgd-all-regulators2.tsv
andtfs_and_targets_heatshock.tsv
) were derived from the Saccharomyces Genome Database (https://www.yeastgenome.org/). Genes for translation factors were derived from the KEGG BRITE database (translation_factors_kegg.tsv
). The annotation gene file was output into230215_labeled_genes_scer_all.tsv
, which is subsequently used in230503_analyze_counts.Rmd
to assign gene annotations. The script to produce the upset plots from Figures S2b-c istxn-comparison.Rmd
, and also takes data from Brion et al. 2016 (data/brion2016-lkluyveri-stress-seq.txt
), sample information (/data/20230309-sample-info.txt
), orthologs (data/master-orthologs.txt
), TPMs (data/20230309_TPM.tsv
), and gene annotations (/data/230508_labeled_genes_scer.tsv
). GEO upload TPMs and counts (Keyport-Kik_2023_TPMs.tsv
andKeyport-Kik_2023_counts.tsv
are found inoutput/
as well, and are identical tooutput/20230309_*.txt
). A subset of custom functions called in scripts are defined inutilityFunctions.R
. A file containing orthologs for all species (also Supplemental File 1),master-orthologs.txt
is found indata/
andsrc/label_Scer_genes/
and is used as an input in the script230503_analyze_counts.Rmd
. Final figures can be found in thefig-svg/
directory.spot-assays/ - Spot assays and plate growth assays were performed for each species at corresponding temperatures. Raw
*.tif
images for Figure 4e and Figure S1a-c (organized as directories) are contained here. Naming conventions for images infig4e/
andfigs1e/
are [date][time][strain/species][duration][temperature]. Naming conventions for images infigs1a/
are [plate][duration][temp][species]. Naming conventions for images infigs1c/
are [date][time][temperature][duration]. Final figures can be found in thefig-svg/
directory.
Possible to auto-transfer the title into the template? Often we ask authors to update their titles because they are insufficient (e.g. "Raw Data"). If it's not auto-updated or generated, it's another check for the curators to perform.
Possible to auto-populate and include DOI and citation in the README file?
Yes, both of these are possible.
- Q: Remove README from landing page. We anticipate there will be formatting issues and readability concerns, curators will need to heavily copyedit the README text because it will be displayed front and center on the landing page. Sometimes these README files can be a giant wall of text that will bury the other metadata fields and (Funding, etc.)
We discussed this a lot in the curator meeting today. The README is meant to be the main file representing the dataset regardless of how it is accessed. We did not know the level of copy-editing you have already been doing, and did not intend for the README to need to be copy-edited, but it has been intended to be in the landing page for some time. For now, we will leave the README on the landing page, but restrict the size of what is shown, and have a “view more” button.
- Q: Should there be a "Beta" label and/or feedback button for this feature? A link?
I thought I could answer this but now I'm not 100% sure which feature you mean? For the README generation as part of the submission, that should feature heavily in the submission feedback survey we are developing (#2699 )
Will these blue boxes be rollout out too? Will this box disappear upon publication of the dataset?
Yes and yes—these are just an altered design of the existing citation and sharing link sections, so they take up less space, but draw the eye a little more, and have a built-in copying function.
The curators are working on edits to the current README template (https://docs.google.com/document/d/1_Lg3xtLFwFbOI697bIuBPPLpQKqYLJ5I7j8UXVlXyIQ/edit)