Reed-CompBio / spras

Signaling Pathway Reconstruction Analysis Streamliner (SPRAS)
MIT License
11 stars 20 forks source link

Contribution guide improvements #84

Closed agitter closed 1 month ago

agitter commented 1 year ago

This issue is to collect challenges and confusion that new team members face as they follow the steps in the contribution guide so that we can improve the document.

One major issue is that "Step 3: Write the Local Neighborhood wrapper functions" is a huge leap from the previous steps. More scaffolding or smaller sub-steps are needed to build toward full implementations of generate_inputs, run_container, and parse_output. For instance, that could involve interactive inspecting the data structures available through the Dataset class and running pandas queries with them before implementing generate_inputs.

ellango2612 commented 1 year ago

One minor thing that I believe could improve readability with the guide is to omit "prize" and replace that with "weight" when mentioning the distance between the 2 groups in step 3:

"Implement the generate_inputs function, following the omicsintegrator1.py example. The nodes should be any node in the dataset that has a prize set, any node that is a source, or any node that is a target. The network should be all of the edges written in the format |. src/dataset.py provides functions that provide access to node information and the interactome (edge list)."

this will make the wording more understandable, or at least the distance being called "weight" is more intuitive for me.

agitter commented 1 year ago

One minor thing that I believe could improve readability with the guide is to omit "prize" and replace that with "weight" when mentioning the distance between the 2 groups in step 3:

This is a good suggestion, though it would involve changing more than the contribution guide. The "prize" jargon comes from the Steiner forest algorithm. We may want to switch to node "scores" or "weights" to be more general.

Right now "prize" is a keyword that has special meaning in our input files so this change would involve changing the input files and generate_inputs functions of a few pathway reconstruction algorithms.

annaritz commented 1 year ago

This is a good suggestion, though it would involve changing more than the contribution guide. The "prize" jargon comes from the Steiner forest algorithm. We may want to switch to node "scores" or "weights" to be more general.

Perhaps we clarify what we mean by prizes. The local neighborhood alg doesn't need any prizes, so in the contributing guide it's a little unclear.

annaritz commented 1 year ago

Thoughts from 5/25 meeting:

agitter commented 1 year ago

should we assume that every algorithm should run on every dataset?

My original idea was that we would try to run algorithms on as many datasets as possible. However, as you point out approximations would need to be made. Currently, these are hidden deeply in the source code, which is bad. The idea of #26 was to have nice user-friendly documentation that presents those assumptions and approximations more transparently, for instance, what an algorithm that expects unweighted graphs would do when it is provided a weighted graph.

agitter commented 1 year ago

Noting the missed filename change from https://github.com/Reed-CompBio/spras/pull/87#issue-1736749517

In Step 4, instead of "Import the new class LocalNeighborhood in PRRunner.py", it should be "... LocalNeighborhood in src/runner.py".

agitter commented 1 year ago

Livvy got stuck during SPRAS setup because the Carpentries Anaconda installation video recommend doesn't describe the required conda init bash step. That's not part of this contribution guide, but could be added to our main readme.

Lyce24 commented 1 year ago

During integrating the random-walk-with-restart algorithm into SPRAS, I have identified some points that could be helpful in improving the CONTRIBUTING guide:

Overall, I found the CONTRIBUTING guide to be an informative and clear introduction, providing step-by-step procedures for contributing my own algorithm.

agitter commented 1 year ago

@Lyce24 what do you think about the idea of keeping the current step-by-step guide as a literal set of instructions to follow for adding Local Neighborhood to SPRAS and then adding a new section with more general considerations for adding additional methods to SPRAS? That new section would cover some of the issues you ran into above about requirements, more complex Dockerfiles, implementing new algorithms in separate repositories, documentation, etc.

We do need more of an introduction to pandas in the Local Neighborhood steps. It isn't possible to implement generate_inputs without pandas knowledge. I often have a separate pandas tutorial notebook that I have students in my group work through before starting the Local Neighborhood guide. Something small like that would be a good prerequisite to help build up familiarity with pandas.

Lyce24 commented 1 year ago

what do you think about the idea of keeping the current step-by-step guide as a literal set of instructions to follow for adding Local Neighborhood to SPRAS and then adding a new section with more general considerations for adding additional methods to SPRAS? That new section would cover some of the issues you ran into above about requirements, more complex Dockerfiles, implementing new algorithms in separate repositories, documentation, etc.

I agree, it is a good idea. Our ultimate goal is to enable people to contribute to SPRAS, and implementing LocalNeighborhood provides an opportunity to gain a comprehensive understanding of the project's structure. This understanding is crucial for future development efforts. Introducing a new section that offers additional details and hints about the issues encountered during implementation will be helpful. Overall, I believe it is a sensible approach that allows us to enhance the project's documentation and make it more accessible to potential contributors.

We do need more of an introduction to pandas in the Local Neighborhood steps. It isn't possible to implement generate_inputs without pandas knowledge. I often have a separate pandas tutorial notebook that I have students in my group work through before starting the Local Neighborhood guide. Something small like that would be a good prerequisite to help build up familiarity with pandas.

And for this part, we (people in Reed) will find some tutorials about Pandas to go through this week too in case there is further usage of it.

agitter commented 1 year ago

will find some tutorials about Pandas

The simple notebook I used to use to introduce pandas is in the repo https://github.com/gitter-lab/network-notebook-examples. Specifically exercises/explore_class_data.ipynb. However, it uses an outdated version of pandas.

There are also some links to other pandas guides, though some are fairly old.

agitter commented 1 year ago

I propose creating a new Step 1 that recommends creating a new fork of the repository and a new branch in that fork. Setting that up can be tricky and we can give instructions.

agitter commented 1 year ago

To help with generate_inputs in Step 3, we can add instructions for first learning about the Dataset class in SPRAS outside of the Snakemake workflow. Here is a way to directly load one of the example datasets in our config file and then inspect the pandas dataframes for the node and edge tables.

>>> from src.dataset import Dataset
>>> dataset_dict = {'label': 'data0', 'node_files': ['node-prizes.txt', 'sources.txt', 'targets.txt'], 'edge_files': ['network.txt'], 'other_files': [], 'data_dir': 'input'}
>>> data = Dataset(dataset_dict)
>>> data.node_table.head()
  NODEID  prize sources targets
0      A    2.0    True     NaN
1      B    NaN     NaN     NaN
2      C    5.7     NaN    True
>>> data.interactome.head()
  Interactor1 Interactor2  Weight
0           A           B    0.98
1           B           C    0.77

It also gives a way to test the functions a dataset provides.

>>> data.request_node_columns(['sources'])
  sources NODEID
0    True      A
ellango2612 commented 1 year ago

This makes sense to me! I think it would be a very helpful change.

agitter commented 1 year ago

Workflows are not automatically enabled and must be turned on manually.

annaritz commented 3 months ago

It's that time of year again - use this GitHub issue to note any challenges you have when going through the Contributing Guide. Thanks!

gabeah commented 3 months ago

It seems some things are out of date. Specifically file paths containing "/src/" should now contain "/spras/".

Also, I kept getting errors from docker when trying to build the SNAKEFILE, after doing some diagnostics, it looks like the problem is in the config.yaml file on line 17. Declaring an owner in the registry does not allow for builds from multiple owners.

sumedhars commented 3 months ago

Some things I faced issues with for Local Neighborhood:

agitter commented 3 months ago

f5b880bd7c396a4a099c46563b9a840a4bb8f4b3 in our header line pull request addresses the parse output instructions