ArcInstitute / ScreenPro2

Flexible analysis of high-content CRISPR screening
https://screenpro2.rtfd.io
Other
16 stars 1 forks source link

Question: library and sample format in cas9.py #45

Closed akshitasax closed 2 months ago

akshitasax commented 2 months ago

Hello!

I am interested in using ScreenPro2/ScreenPro to analyze reads from a dual-guide CRISPR screen. Specifically, I wanted to get counts from my reads files and map them to my dual-guide library. I understand that user guidelines for the 'Fastq to counts' pipeline have not been posted to the readme yet, so I wanted to try to run the functions in Cas9.py on a jupyter notebook. What is the format of the dual-guide library that is given as an input in the map_sample_counts_to_library(library, sample) function? And is 'sample' the df_count returned by fastq_to_count_dual_guide()?

Thank you, Akshita

abearab commented 2 months ago

Hello Akshita,

Thanks for your interest in using ScreenPro2, happy to respond to any questions!

I understand that user guidelines for the 'Fastq to counts' pipeline have not been posted to the readme yet, so I wanted to try to run the functions in Cas9.py on a jupyter notebook.

Sounds like a good plan! As part of #44, I'm actively working on more user friendly interface for ScreenPro2.

What is the format of the dual-guide library that is given as an input in the map_sample_counts_to_library(library, sample) function?

That's a good question! You need to have the have the library table and modify that based on your sequencing strategy. Can you tell me if what CRISPR library and sequencing protocol you are using? Is it from Replogle et al., eLife (2022) and in that case your library-prep protocol should be this, right? Once you clarify I can help you run the codes in jupyter.

And is 'sample' the df_count returned by fastq_to_count_dual_guide()?

That's true. This function is going to be changes in the next release. For the meantime, this is what I have done: I modified df_count to created a "sequence" column which would be matched with the "sequence" column in the library table. It depends on the sequencing strategy.


This is how I did fastq2count task for a dataset based on the Replogle et al., eLife (2022) protocol:

    df_count = cas9.fastq_to_count_dual_guide(
        R1_fastq_file_path=f'{fastq_dir}/{sample_id}_R1.fastq.gz',
        R2_fastq_file_path=f'{fastq_dir}/{sample_id}_R2.fastq.gz',
        trim5p_pos1_start=1,
        trim5p_pos1_length=19,
        trim5p_pos2_start=1,
        trim5p_pos2_length=19,
        verbose=verbose
    )
akshitasax commented 2 months ago

Thanks for your response!

Our library is a dual-gRNA library, each protospacer being 20 bp, which we cloned using this protocol from the Weissman lab https://weissman.wi.mit.edu/resources/2022_crispri_protocols/Protocol_1_dual_sgRNA_lib_cloning.pdf The library consists of 20 different guide pairs, without any unique indexing sequence for each pair.

We sequenced our samples via paired-end 250 NGS. And I already have trimmed R1 and R2 files that we processed using Geneious. So every read should already be trimmed to the 20 bp guide sequence that we are interested in counting. So, I was currently omitting the trimming arguments in the count function.

I was able to run the count function and obtain a df_count output. Let me know how the library table should be set up, what column names I should be using, etc. And let me know if any more information would help!

Thanks, Akshita

abearab commented 2 months ago

Thanks for your response!

Our library is a dual-gRNA library, each protospacer being 20 bp, which we cloned using this protocol from the Weissman lab https://weissman.wi.mit.edu/resources/2022_crispri_protocols/Protocol_1_dual_sgRNA_lib_cloning.pdf The library consists of 20 different guide pairs, without any unique indexing sequence for each pair.

We sequenced our samples via paired-end 250 NGS. And I already have trimmed R1 and R2 files that we processed using Geneious. So every read should already be trimmed to the 20 bp guide sequence that we are interested in counting. So, I was currently omitting the trimming arguments in the count function.

I was able to run the count function and obtain a df_count output. Let me know how the library table should be set up, what column names I should be using, etc. And let me know if any more information would help!

Thanks, Akshita

Okay, it make sense. If you already have your df_count, then that tells you the abundance for each sgRNA pairs. Now you need to map that to your library table. Your library table should have protospacer A/B sequence and you can concatenate them with a ; separator. I would name this new column "sequence" and make a similar column from df_count. See this example code:

    res = df_count.copy()
    res = res.sort('count', descending=True)
    res = res.with_columns(
        pl.concat_str(
            [
                pl.col("protospacer_a"),
                pl.col("protospacer_b")
            ],
            separator=";",
        ).alias("sequence"),
    )

    res_map = pl.DataFrame(library).join(
            res, on="sequence", how="left"
        )

** pl stands for polars


You may need to iterate at this step to make sure your "sequence" columns are defined properly (for instance, we don't sequence the first bp so we have 19bp in R1 / R2 instead of 20; yours can be a bit different). As a simple QC, you can do something like this

        print("% mapped reads",
            100 * \
            res_map.to_pandas()['count'].fillna(0).sum() / \
            int(res.select(pl.sum("count")).to_pandas()['count'])
        )

Best of luck.

abearab commented 2 months ago

@akshitasax, in the meantime, if you don't mind please ⭐ the repo ;)

abearab commented 2 months ago

47 discussed here.

abearab commented 2 months ago

@akshitasax – I think you get your answers here, right? Fell free to open new issue / reopen this / or email me if you have other questions.