databio / bdshack19

Coordinating the 2019 Biomedical Data Science Hackathon at UVA
3 stars 3 forks source link

BDS Hackathon 2019


1. About

The Biomedical Data Sciences Training Program (bds_tg) is organizing a 2nd annual hackathon for April 4-5, 2019. This will be a 48-hour event nucleated by the current trainees on the Biomedical Data Science NIH T32 training grant, and will be open to other interested participants. Food will be provided to those who have RSVP'd to fuel a collaborative and productive 48 hours.

Last year, the prior hackathon analyzed the CITEseq dataset (one of the first multi-omic single-cell datasets) and presented their efforts at the annual UVA Datapalooza.

This year, we will apply collective skillsets of all attendees towards analyzing single-cell resolution multi-omic data, with the aim of producing a python package that will automate aspects of analyzing this increasingly popular data type.

Skills required

No previous skills are required to attend, but just one of any of the following could be helpful:

Feel free to share this announcement with individuals you think might be interested in participating!

Expectations


2. Logistics

Date/Time

Thu Apr 4th 9am - Fri Apr 5th midnight

Location

Rice Hall (computer science building)

Presentation

After the dedicated 2-day hackathon concludes, we will aim to present the work to a broader audience (scheduling with CvilleBioHub some time in ~May).

Chat and Social Media

2019 Organizers

All three will be in and out throughout the event to help and respond to questions both in person and remotely.

Make sure to thank Jason and Kim for making this possible, and feeding you!


3. Deliverable

We seek to develop a python package to load, store, visualize, and analyze recently published single-cell multi-omic datasets (see data section below). We should take advantage of existing tools where available. Our python package will need to consider and implement these elements:

Packaging structure

Making our functions installable and easily distributable would lead to convenient usage by others. Organizers can help here, when we get to this point!

Can take a look at cookie cutter examples/templates to guide initial development:

Parsing

What data types will we work with?

Python object

Thinking about input/output as well as manipulating the data (in RAM, on disk):

Visualization

Visuals are fundamentally a key aspect of human communication and information exchange (highest bandwidth sensory modality, not to mention the backbone of scientific publications or presentations). A good visual at any step of the analysis should be useful for summarizing and/or communicating your ideas or findings.

How should the data be visualized?

Bonus: if combined with interactive tools, a "tactile" dimension can be added to the data visualization, enabling "hands-on" exploration

Machine learning

"Machine learning" is a general word that may mean different things to different people. Technically, it includes both supervised (useful to make predictions) and unsupervised (useful for exploring unknowns without prior assumptions) strategies, or combinations of these; and may range from simple algorithms like linear regression (fitting a line to some points) to fancier/newer algorithms like deep neural networks running on GPUs.

Before going down the rabbit hole of which tools to utilize/wrangle in a 2-day time window, start by considering:

Checking some published papers, or existing documentation in various tools may serve as inspiration for some of these questions.


4. Data

Option 1. pi-ATAC (ATAC + protein)

paper (Chen 2018): https://www.nature.com/articles/s41467-018-07115-y

Data from GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE112091

Pros:

Cons:

Option 2. sci-CAR (RNA + ATAC)

paper (Cao 2018): http://science.sciencemag.org/content/361/6409/1380

Data from GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE117089

Pros:

Cons:

associated scripts for processing sci-CAR

Option 3. CITE-seq (RNA + protein)

paper (Stoeckius 2017): https://www.nature.com/nmeth/journal/v14/n9/full/nmeth.4380.html

Data from GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE100866

Pros:

Cons:

Option 4. Cell Hashing (RNA + protein + sample barcode)

paper (Stoeckius 2018): https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1603-1

Data from GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE108313

Pros:

Cons:

included tool for decoding indexes from CITE-seq/Cell-hashing data:

Option 5. ECCITE-seq (RNA + protein + sample barcode + VDJ clonality + CRISPR sgRNA)

paper (Mimitou 2019): https://www.biorxiv.org/content/10.1101/466466v1

Data from GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE126310

Pros:

Cons:


5. Resources

Computing

Throughout the event, please commit any code into this repository (the repo you're reading right now). Start new folders for code, or other work. And most importantly, remember to document your work and thought process, as if a colleague were reading it for the first time.

Most data analysis can be done on your own laptop (especially if you already have popular data science analysis software installed). Please bring a laptop (+ charger) if you're interested in participating!

For some specific tasks that require specialized environments and/or heavy-lifting, access to Rivanna, the large-scale computing cluster at UVA is useful. UVA affiliates can request access, if you do not already have it (see instructions on UVA ARCS website, as well as guides on how to use it, and an intro UNIX tutorial).

Current T32 trainees on the BDSTG should already be members of the bds_tg group on Rivanna. Anybody in this group will already be able to access an additional allocation of permanent disk space on Rivanna at /sfs/lustre/allocations/bds_tg

We've recently just created a bdshackathon Rivanna group and added those who have RSVP'd to it. In theory, this should automatically grant access to rivanna, with a temporary number of credit hours allocated that should be sufficient to get started until you request your own. See:

Shared folder on Rivanna (for all hackathon attendees):

Tools

The tools in multi-omic single-cell resolution analysis are all relatively new, with more growing by the day.

Feel free to look for others, reference any you find useful, and mix & match to make the most of analyzing/integrating data. There may be overlap between tools, especially if they are "tools of tools" built on the same underlying algorithms or popular base libraries.

Depending on your goal, it may be quicker to wrangle tabular data using existing general-purpose data science libraries that you're already familiar with. In other cases, it will be absolutely necessary to leverage existing tools, instead of re-inventing the wheel on a complicated function. We encourage you to work together, divide-and-conquer, and switch back-and-forth accordingly to make the most of time available.

General Purpose tools

scRNAseq toolkits

Data Structures for large multi-feature + metadata handling

Tutorials

Feel free to add here, if you find useful ones.