StanfordBioinformatics / pulsar_lims

A LIMS for ENCODE submitting labs.
3 stars 1 forks source link

Best practice for Data submission #118

Open twang15 opened 3 years ago

twang15 commented 3 years ago

Hi Daniel and Tao,

Thank you for bringing your extensive empirical knowledge to bear to drive a rich discussion of the challenges of managing research data in consortiums. Based on everything you discussed, I've seeded a document with recommended practices for data management. I expect this to be a living document that we continue to flesh out and refine as we continue our discussions, and that it will eventually become a general resource for managing data in consortium projects.

You should both have editing privileges and I encourage you to use them to comment on and add to the doc. We can discuss the first draft in the general group meeting once Daniel is settled (March 24) and revisit in the Friday meeting (March 26).

Best, Paul

https://docs.google.com/document/d/1np_DiA7NcXBhmMMxtK1-QY1KVC60qHVM0UGsBn3S4TY/edit#heading=h.yhjv011xls1l

twang15 commented 3 years ago

Also, I should say, I'm hoping we can also generate actionable technical​ solutions out of this. And who knows what else? The sky is the limit.

twang15 commented 3 years ago

Thanks Paul! Let’s investigate further and improve it together.

twang15 commented 3 years ago

Daniel:

Yes, thanks for getting us together. By the way, if you want to listen to an interview with that author I mentioned, here a link to the podcast I heard about him on:

[The Art of Manliness] Email Is Making Us Miserable — Here's What to Do About It #theArtOfManliness https://podcastaddict.com/episode/119896723 via @PodcastAddict

twang15 commented 3 years ago

Currently, there is a conversation involving Sequencing center (John), Wet lab (Annika), SCG (Keith and team), Data Submission team (Tao).

  1. Since not everyone is on slack, we use email to communicate. (This condition will be harder to meet if there is any outside agency involved, e.g, Chicago).
  2. Dependencies: John needs space under Tao's folder on SCG to deliver data to Tao; Annika needs the data to write grant; Tao needs the data to submit it; Tao needs Keith to investigate the problem; Keith needs error message from John to diagnose the problem.
  3. The problem here is a multi-lateral communication and collaboration on email. If there is a tool that can code the above dependency graph at any single time and show it to everyone involved, it will be very helpful. Plus, if communication/message can be initiated/sent by clicking buttons on the graph, it will be very convenient; if all the communication logs can be extracted with timestamps into an issue tracker, it will also be helpful.
  4. The essence of such a tool is to maintain the dependency graph, facilitate communication (sending or receiving notifications) and logging.
twang15 commented 3 years ago

An idea is to implement this tool in an event-based (for notification) web-framework (GUI). For the success of such a tool,

  1. everybody involved is willing to use it.
  2. easy to use (easier than writing an email)
twang15 commented 3 years ago

Another issue on submission is the "push or pull" workflow.

  1. Snyder Wet lab has their own in-house information management system, which I (Data submission team) do not have access to. They maintain their system on google spread sheet to keep track of samples, experiments, issues, etc.
  2. Part of the end product of this system is meta data submit into Pulsar in small pieces.
  3. Pulsar is just part of their system and work flow.
  4. For submission, I could either pull information from the wet lab or the lab push the information ready to me.
twang15 commented 3 years ago

Complexity example: Data quality issues introduce dependency on human: we have to manually cherry-pick which datasets to submit. https://github.com/nathankw/pulsar_lims/issues/117

twang15 commented 3 years ago

Pulsar was designed for automatic data submission purpose. But, the wet lab team also need to keep track of many things. Such a management view is missing!

twang15 commented 3 years ago

We need a notification system: when something is ready for the next stage. https://github.com/nathankw/pulsar_lims/issues/131 https://github.com/nathankw/pulsar_lims/issues/132

twang15 commented 3 years ago

Memo, 03/26/2021, Friday 11:00 am

  1. Connect GitHub with outlook
  2. Crazy idea: (Tao) If we have a system general enough, we can configure it to other projects quickly
  3. Daniel
    • data mart
    • use spreed sheet as the interface but extracting information to a database for more complex opertations
  4. Tao: DSL taking specifications and generating controller/model/views/API automatically
    • is there such a tool for rails?
    • Use Github to manage tasks: noun (task categories) + depends (dependencies)
twang15 commented 3 years ago

Annika demonstrated a project management tool: Nirvna (https://nirvana.work) It is one of many project management tools.
Compared to Github, its strength is for an individual project manager to finely granulize tasks, based on their expected work-time, due date, etc, while Github is more like a collaborative tool for an entire team.

We could potentially build a tool to integrate the best of both world.

twang15 commented 3 years ago

Meeting memo: 04/09/2021, Annika, Paul, Tao, Daniel

  1. assign tasks to people and keep track of their responses
  2. automatically send out emails and organize them
  3. centralize the management across different agencies
  4. find a tool working for everyone across different labs, universities
  5. people are diverse and have very different backgrounds, ie., computer skills and access to computers
  6. A survey of tools for project management
    • tools
    • evaluation
    • An integration: Nirvina + slack + GitHub + LIMS + Spread sheet (can we make the spread sheet connected to a backend database?) + some innovative techs
      • drop-down menu is good to restrict the proper input
  7. Challenges
    • people do not have computer skills, it is even difficult to use LIMS
    • GitHub and slack are more difficult and not accessible (assumption of access to a desktop computer, which is often not true for them)
    • many files, many elements to keep tracks
      • command line is not user-friendly: find, copy, download,
    • An instant relief:
      • on-demand SCG
    • SCG-hosting ChIP/Single-cell/ etc.., Pipeline, eg, https://truwl.com

TODOs

  1. SCG command-line cheat sheet (Paul, Daniel, Tao)
  2. More technical issues in the wet lab (Annika)
  3. Comparison of different tools (Nirvira, etc)
twang15 commented 3 years ago

Related work:

  1. Automatic Task extraction with NLP: https://hguo5.github.io/files/COLING-20-Lin.pdf