Closed agitter closed 1 month ago
One minor thing that I believe could improve readability with the guide is to omit "prize" and replace that with "weight" when mentioning the distance between the 2 groups in step 3:
"Implement the generate_inputs function, following the omicsintegrator1.py example. The nodes should be any node in the dataset that has a prize set, any node that is a source, or any node that is a target. The network should be all of the edges written in the format
this will make the wording more understandable, or at least the distance being called "weight" is more intuitive for me.
One minor thing that I believe could improve readability with the guide is to omit "prize" and replace that with "weight" when mentioning the distance between the 2 groups in step 3:
This is a good suggestion, though it would involve changing more than the contribution guide. The "prize" jargon comes from the Steiner forest algorithm. We may want to switch to node "scores" or "weights" to be more general.
Right now "prize" is a keyword that has special meaning in our input files so this change would involve changing the input files and generate_inputs
functions of a few pathway reconstruction algorithms.
This is a good suggestion, though it would involve changing more than the contribution guide. The "prize" jargon comes from the Steiner forest algorithm. We may want to switch to node "scores" or "weights" to be more general.
Perhaps we clarify what we mean by prizes. The local neighborhood alg doesn't need any prizes, so in the contributing guide it's a little unclear.
Thoughts from 5/25 meeting:
config.yaml
we have to decide whether to handle node files with prizes or sources/targets files, and how they map to inputs for localNeighborhood
. Instead, suggest a small yaml file for testing? Or specify how to run individual jobs w/ Snakemake? yaml
file to add localNeighborhood
, it occurs to us that you need to modify ALL yaml
files (for example, the egfr.yaml
file also needs to be updated). should we assume that every algorithm should run on every dataset?
My original idea was that we would try to run algorithms on as many datasets as possible. However, as you point out approximations would need to be made. Currently, these are hidden deeply in the source code, which is bad. The idea of #26 was to have nice user-friendly documentation that presents those assumptions and approximations more transparently, for instance, what an algorithm that expects unweighted graphs would do when it is provided a weighted graph.
Noting the missed filename change from https://github.com/Reed-CompBio/spras/pull/87#issue-1736749517
In Step 4, instead of "Import the new class LocalNeighborhood in PRRunner.py", it should be "... LocalNeighborhood in src/runner.py".
Livvy got stuck during SPRAS setup because the Carpentries Anaconda installation video recommend doesn't describe the required conda init bash
step. That's not part of this contribution guide, but could be added to our main readme.
During integrating the random-walk-with-restart algorithm into SPRAS, I have identified some points that could be helpful in improving the CONTRIBUTING
guide:
LocalNeighborhood
does not require any dependency packages, I faced difficulties while implementing my algorithm, which required networkx
, numpy
, and scipy
. Adding links to tutorials on creating Dockerfiles could be useful for contributors who are unfamiliar with this process.docker-wrapper
. This will formalize the contributing process and maintain consistency.pandas
is important for SPRAS, it could be listed as a prerequisite for contributing to this project. Pandas is useful for data analysis tasks, such as multiplying adjacency matrices of graphs. With knowledge of pandas
, contributors will feel more comfortable writing the generate_inputs
and parse_output
functions. I personally experienced some difficulties when encountering pandas for the first time.Overall, I found the CONTRIBUTING
guide to be an informative and clear introduction, providing step-by-step procedures for contributing my own algorithm.
@Lyce24 what do you think about the idea of keeping the current step-by-step guide as a literal set of instructions to follow for adding Local Neighborhood to SPRAS and then adding a new section with more general considerations for adding additional methods to SPRAS? That new section would cover some of the issues you ran into above about requirements, more complex Dockerfiles, implementing new algorithms in separate repositories, documentation, etc.
We do need more of an introduction to pandas in the Local Neighborhood steps. It isn't possible to implement generate_inputs
without pandas knowledge. I often have a separate pandas tutorial notebook that I have students in my group work through before starting the Local Neighborhood guide. Something small like that would be a good prerequisite to help build up familiarity with pandas.
what do you think about the idea of keeping the current step-by-step guide as a literal set of instructions to follow for adding Local Neighborhood to SPRAS and then adding a new section with more general considerations for adding additional methods to SPRAS? That new section would cover some of the issues you ran into above about requirements, more complex Dockerfiles, implementing new algorithms in separate repositories, documentation, etc.
I agree, it is a good idea. Our ultimate goal is to enable people to contribute to SPRAS, and implementing LocalNeighborhood provides an opportunity to gain a comprehensive understanding of the project's structure. This understanding is crucial for future development efforts. Introducing a new section that offers additional details and hints about the issues encountered during implementation will be helpful. Overall, I believe it is a sensible approach that allows us to enhance the project's documentation and make it more accessible to potential contributors.
We do need more of an introduction to pandas in the Local Neighborhood steps. It isn't possible to implement
generate_inputs
without pandas knowledge. I often have a separate pandas tutorial notebook that I have students in my group work through before starting the Local Neighborhood guide. Something small like that would be a good prerequisite to help build up familiarity with pandas.
And for this part, we (people in Reed) will find some tutorials about Pandas to go through this week too in case there is further usage of it.
will find some tutorials about Pandas
The simple notebook I used to use to introduce pandas is in the repo https://github.com/gitter-lab/network-notebook-examples. Specifically exercises/explore_class_data.ipynb
. However, it uses an outdated version of pandas.
There are also some links to other pandas guides, though some are fairly old.
I propose creating a new Step 1 that recommends creating a new fork of the repository and a new branch in that fork. Setting that up can be tricky and we can give instructions.
To help with generate_inputs
in Step 3, we can add instructions for first learning about the Dataset
class in SPRAS outside of the Snakemake workflow. Here is a way to directly load one of the example datasets in our config file and then inspect the pandas dataframes for the node and edge tables.
>>> from src.dataset import Dataset
>>> dataset_dict = {'label': 'data0', 'node_files': ['node-prizes.txt', 'sources.txt', 'targets.txt'], 'edge_files': ['network.txt'], 'other_files': [], 'data_dir': 'input'}
>>> data = Dataset(dataset_dict)
>>> data.node_table.head()
NODEID prize sources targets
0 A 2.0 True NaN
1 B NaN NaN NaN
2 C 5.7 NaN True
>>> data.interactome.head()
Interactor1 Interactor2 Weight
0 A B 0.98
1 B C 0.77
It also gives a way to test the functions a dataset provides.
>>> data.request_node_columns(['sources'])
sources NODEID
0 True A
This makes sense to me! I think it would be a very helpful change.
Workflows are not automatically enabled and must be turned on manually.
It's that time of year again - use this GitHub issue to note any challenges you have when going through the Contributing Guide. Thanks!
It seems some things are out of date. Specifically file paths containing "/src/" should now contain "/spras/".
Also, I kept getting errors from docker when trying to build the SNAKEFILE, after doing some diagnostics, it looks like the problem is in the config.yaml file on line 17. Declaring an owner in the registry does not allow for builds from multiple owners.
Some things I faced issues with for Local Neighborhood:
parse_output
, need to reinsert directionality generate_inputs
→ get rid of prize in the node_df
generate_inputs
, set headers to False and remove directionality and weightsreedcompbio
under container_registry
after setting rest of the algorithm includes to falsef5b880bd7c396a4a099c46563b9a840a4bb8f4b3 in our header line pull request addresses the parse output instructions
This issue is to collect challenges and confusion that new team members face as they follow the steps in the contribution guide so that we can improve the document.
One major issue is that "Step 3: Write the Local Neighborhood wrapper functions" is a huge leap from the previous steps. More scaffolding or smaller sub-steps are needed to build toward full implementations of
generate_inputs
,run_container
, andparse_output
. For instance, that could involve interactive inspecting the data structures available through theDataset
class and running pandas queries with them before implementinggenerate_inputs
.