Contribution guide improvements

agitter commented 1 year ago

This issue is to collect challenges and confusion that new team members face as they follow the steps in the contribution guide so that we can improve the document.

One major issue is that "Step 3: Write the Local Neighborhood wrapper functions" is a huge leap from the previous steps. More scaffolding or smaller sub-steps are needed to build toward full implementations of generate_inputs, run_container, and parse_output. For instance, that could involve interactive inspecting the data structures available through the Dataset class and running pandas queries with them before implementing generate_inputs.

ellango2612 commented 1 year ago

One minor thing that I believe could improve readability with the guide is to omit "prize" and replace that with "weight" when mentioning the distance between the 2 groups in step 3:

"Implement the generate_inputs function, following the omicsintegrator1.py example. The nodes should be any node in the dataset that has a prize set, any node that is a source, or any node that is a target. The network should be all of the edges written in the format |. src/dataset.py provides functions that provide access to node information and the interactome (edge list)."

this will make the wording more understandable, or at least the distance being called "weight" is more intuitive for me.

agitter commented 1 year ago

One minor thing that I believe could improve readability with the guide is to omit "prize" and replace that with "weight" when mentioning the distance between the 2 groups in step 3:

This is a good suggestion, though it would involve changing more than the contribution guide. The "prize" jargon comes from the Steiner forest algorithm. We may want to switch to node "scores" or "weights" to be more general.

Right now "prize" is a keyword that has special meaning in our input files so this change would involve changing the input files and generate_inputs functions of a few pathway reconstruction algorithms.

annaritz commented 1 year ago

This is a good suggestion, though it would involve changing more than the contribution guide. The "prize" jargon comes from the Steiner forest algorithm. We may want to switch to node "scores" or "weights" to be more general.

Perhaps we clarify what we mean by prizes. The local neighborhood alg doesn't need any prizes, so in the contributing guide it's a little unclear.

annaritz commented 1 year ago

Thoughts from 5/25 meeting:

consider linking to a Docker tutorial for some items (e.g., when creating the Dockerfile, point to a tutorial with a step that describes a Dockerfile).
It's unclear which Python image to choose; how can we give guidance without just saying "use alpine 10.10?"
In general, the SPRAS inputs require headers, and some algs don't require them. Should we denote headers with a special character? Or should this be documented a bit better?
In Step 3 on, it's unclear what config file we should be running to test the code. For example, using config.yaml we have to decide whether to handle node files with prizes or sources/targets files, and how they map to inputs for localNeighborhood. Instead, suggest a small yaml file for testing? Or specify how to run individual jobs w/ Snakemake?
When modifying the yaml file to add localNeighborhood, it occurs to us that you need to modify ALL yaml files (for example, the egfr.yaml file also needs to be updated).
(big comment): should we assume that every algorithm should run on every dataset? For example, if there's an algorithm for an unweighted graph, if a weighted graph is passed in, the method could (a) ignore it & require the user to make the graph unweigthed, or (b) convert the weighted graph to an unweighted one and run it. Should the developers (us) make that decision?

agitter commented 1 year ago

should we assume that every algorithm should run on every dataset?

My original idea was that we would try to run algorithms on as many datasets as possible. However, as you point out approximations would need to be made. Currently, these are hidden deeply in the source code, which is bad. The idea of #26 was to have nice user-friendly documentation that presents those assumptions and approximations more transparently, for instance, what an algorithm that expects unweighted graphs would do when it is provided a weighted graph.

agitter commented 1 year ago

Noting the missed filename change from https://github.com/Reed-CompBio/spras/pull/87#issue-1736749517

In Step 4, instead of "Import the new class LocalNeighborhood in PRRunner.py", it should be "... LocalNeighborhood in src/runner.py".

agitter commented 1 year ago

Livvy got stuck during SPRAS setup because the Carpentries Anaconda installation video recommend doesn't describe the required conda init bash step. That's not part of this contribution guide, but could be added to our main readme.

Lyce24 commented 1 year ago

During integrating the random-walk-with-restart algorithm into SPRAS, I have identified some points that could be helpful in improving the CONTRIBUTING guide:

In the second step, when writing the Dockerfile, it would be beneficial to mention the need to handle dependencies. As LocalNeighborhood does not require any dependency packages, I faced difficulties while implementing my algorithm, which required networkx, numpy, and scipy. Adding links to tutorials on creating Dockerfiles could be useful for contributors who are unfamiliar with this process.
It might be helpful to indicate in step 2 that it is beneficial to create a separate repository for the algorithm implementation. Contributors can then push the hashed commit from GitHub, following the same approach as other algorithms in the docker-wrapper. This will formalize the contributing process and maintain consistency.
Considering that familiarity with pandas is important for SPRAS, it could be listed as a prerequisite for contributing to this project. Pandas is useful for data analysis tasks, such as multiplying adjacency matrices of graphs. With knowledge of pandas, contributors will feel more comfortable writing the generate_inputs and parse_output functions. I personally experienced some difficulties when encountering pandas for the first time.

Overall, I found the CONTRIBUTING guide to be an informative and clear introduction, providing step-by-step procedures for contributing my own algorithm.

agitter commented 1 year ago

@Lyce24 what do you think about the idea of keeping the current step-by-step guide as a literal set of instructions to follow for adding Local Neighborhood to SPRAS and then adding a new section with more general considerations for adding additional methods to SPRAS? That new section would cover some of the issues you ran into above about requirements, more complex Dockerfiles, implementing new algorithms in separate repositories, documentation, etc.

We do need more of an introduction to pandas in the Local Neighborhood steps. It isn't possible to implement generate_inputs without pandas knowledge. I often have a separate pandas tutorial notebook that I have students in my group work through before starting the Local Neighborhood guide. Something small like that would be a good prerequisite to help build up familiarity with pandas.

Lyce24 commented 1 year ago

what do you think about the idea of keeping the current step-by-step guide as a literal set of instructions to follow for adding Local Neighborhood to SPRAS and then adding a new section with more general considerations for adding additional methods to SPRAS? That new section would cover some of the issues you ran into above about requirements, more complex Dockerfiles, implementing new algorithms in separate repositories, documentation, etc.

I agree, it is a good idea. Our ultimate goal is to enable people to contribute to SPRAS, and implementing LocalNeighborhood provides an opportunity to gain a comprehensive understanding of the project's structure. This understanding is crucial for future development efforts. Introducing a new section that offers additional details and hints about the issues encountered during implementation will be helpful. Overall, I believe it is a sensible approach that allows us to enhance the project's documentation and make it more accessible to potential contributors.

We do need more of an introduction to pandas in the Local Neighborhood steps. It isn't possible to implement generate_inputs without pandas knowledge. I often have a separate pandas tutorial notebook that I have students in my group work through before starting the Local Neighborhood guide. Something small like that would be a good prerequisite to help build up familiarity with pandas.

And for this part, we (people in Reed) will find some tutorials about Pandas to go through this week too in case there is further usage of it.

agitter commented 1 year ago

will find some tutorials about Pandas

The simple notebook I used to use to introduce pandas is in the repo https://github.com/gitter-lab/network-notebook-examples. Specifically exercises/explore_class_data.ipynb. However, it uses an outdated version of pandas.

There are also some links to other pandas guides, though some are fairly old.

agitter commented 1 year ago

I propose creating a new Step 1 that recommends creating a new fork of the repository and a new branch in that fork. Setting that up can be tricky and we can give instructions.

agitter commented 1 year ago

To help with generate_inputs in Step 3, we can add instructions for first learning about the Dataset class in SPRAS outside of the Snakemake workflow. Here is a way to directly load one of the example datasets in our config file and then inspect the pandas dataframes for the node and edge tables.

>>> from src.dataset import Dataset
>>> dataset_dict = {'label': 'data0', 'node_files': ['node-prizes.txt', 'sources.txt', 'targets.txt'], 'edge_files': ['network.txt'], 'other_files': [], 'data_dir': 'input'}
>>> data = Dataset(dataset_dict)
>>> data.node_table.head()
  NODEID  prize sources targets
0      A    2.0    True     NaN
1      B    NaN     NaN     NaN
2      C    5.7     NaN    True
>>> data.interactome.head()
  Interactor1 Interactor2  Weight
0           A           B    0.98
1           B           C    0.77

It also gives a way to test the functions a dataset provides.

>>> data.request_node_columns(['sources'])
  sources NODEID
0    True      A

ellango2612 commented 1 year ago

This makes sense to me! I think it would be a very helpful change.

agitter commented 1 year ago

Workflows are not automatically enabled and must be turned on manually.

annaritz commented 3 months ago

It's that time of year again - use this GitHub issue to note any challenges you have when going through the Contributing Guide. Thanks!

gabeah commented 3 months ago

It seems some things are out of date. Specifically file paths containing "/src/" should now contain "/spras/".

Also, I kept getting errors from docker when trying to build the SNAKEFILE, after doing some diagnostics, it looks like the problem is in the config.yaml file on line 17. Declaring an owner in the registry does not allow for builds from multiple owners.

sumedhars commented 3 months ago

Some things I faced issues with for Local Neighborhood:

In parse_output, need to reinsert directionality
In generate_inputs→ get rid of prize in the node_df
In generate_inputs, set headers to False and remove directionality and weights
Pay attention to separators when generating raw pathways and network
In the config.yaml file, make sure to edit the owner attribute to your docker username instead of reedcompbio under container_registry after setting rest of the algorithm includes to false

agitter commented 3 months ago

f5b880bd7c396a4a099c46563b9a840a4bb8f4b3 in our header line pull request addresses the parse output instructions

Reed-CompBio / spras

Contribution guide improvements #84