Adjust alevin to use splici index and USA mode

allyhawkins commented 3 years ago

Closes #91. This PR changes some of the parameters in the alevin-fry and cellranger workflows being used for benchmarking.

For alevin-fry, I have done the following:

Combined the previous parameters for index_name and t2g to one index_type parameter to indicate if the workflow should be run using settings for the single-cell (cell) or single-nucleus (splici) samples. The default is set to cell and will then assign the index_name and corresponding tx2gene mapping file.
Added a parameter for the tx2gene_3col.tsv file needed to run Alevin-fry in USA mode. This can only be used if the index_name is set to splici and the resolution is cr-like. Otherwise, the 2 column tx2gene file is still used.
The unfiltered parameter has now been changed to params.filter = unfiltered as the default. If that is changed to knee then the --knee-distance flag is used to filter true barcodes during the generate-permit-list step. Based on our discussion of switching to only using the unfiltered version moving forward, I chose to set this as default and keep knee as an option just in case moving forward.

Originally, I was planning on using the splici index as the default index for all single-nuclei RNA-seq samples, but with the addition of testing the splici index on single-cell samples that became less straight forward. Instead, I included a map of index_names and tx2gene files based on the index_type (currently set as cell or splici) with the anticipation that if we do choose to use an index for single-cell vs. single-nuclei, we can use the seq_unit column of the metadata to set the index and tx2gene (similar to how the barcode list is assigned now).

For Cellranger, I set the default for single-nuclei samples to run with the pre-mRNA index using the --include-introns option, removing it as a parameter.

I also did think about adding in checks to make sure the input parameters were correct, specifically the params.filter and the params.resolution options, but didn't think that was necessary quite yet. Probably something we want to consider adding if we do choose alevin-fry as our permanent workflow.

allyhawkins commented 3 years ago

Thanks for all the helpful comments! I went back through and I believe I have addressed all of your suggestions and comments now. The main edits I did were the following:

I reorganized the workflow in general so all parameters are at the top, followed by any of the maps or lists that we use, then the processes, and lastly the workflow.
I moved the index and tx2gene paths out of paramaters and into the workflow. Additionally, I omitted the use of a ternary operator to build the tx2gene_3col_path and am now using a separate if/else statement (or what is now being called the t2g_quant_path - I'll explain why I still think we need both of these below).
I removed using the ternary operator for the choice in filtering options during generate-permit-list and now use a map within that process to assign the flag and barcode file if applicable.

I think the main thing I am trying to understand is how the index and resolution options interact (or don't). It seems like if we use splici, we will always be using cr-like resolution? Or am I misunderstanding that? We might also want to use cr-like with other tools, and perhaps that would become our new default...

In regards to your confusion between use of cr-like and splici together, the only way that we can use the USA mode with single-cell RNA sequencing (and subsequently use the functions written by alevin-fry developers) is to use those two conditions simultaneously. However, we have been using the splici index with the full resolution and using our own functions to read in spliced + intronic counts and combining them specifically in the context of the single-nuclei samples. It was my understanding that if we are still testing cr-like vs. the full resolution (for both snRNA and scRNA) and the use of splici vs. cDNA for the single-cell samples we would want to have the option of being able to use any combination of those two. Once we have narrowed down a default, then we could combine it into one argument if we feel like that is necessary. I'm not sure I understand what you mean by also using cr-like with other tools?

I'm a little confused by the logic here. If we only use t2g_3col with the cr-like & splici settings, do we need to set t2g_3col_path to the other t2g file in the other cases? Looking at this again, it seems like we could put all of this in t2g_path as we don't (by my quick look) use that at all.

So this isn't actually true. For the scenario that we are doing cr-like and splici, we need both the 2 column tx2gene file and the 3 column file. The 2 column file is being used for the first alevin process during alignment and no matter what the resolution or index, we will always need to use the 2 column file for the first step. The 3 column file is what is needed only in the context of cr-like and splici and only during the last alevin-fry quant step. Otherwise, that step requires the 2 column tx2gene mapping. To try and clear this up a little in the actual code, I renamed it to be the t2g_quant_path and it is only set to the 3 column file with cr-like and splici, otherwise it is equivalent to the t2g_path used in the alevin step. If you have other suggestions on how to deal with this I am all ears!

jashapiro commented 3 years ago

So this isn't actually true. For the scenario that we are doing cr-like and splici, we need both the 2 column tx2gene file and the 3 column file. The 2 column file is being used for the first alevin process during alignment and no matter what the resolution or index, we will always need to use the 2 column file for the first step. The 3 column file is what is needed only in the context of cr-like and splici and only during the last alevin-fry quant step. Otherwise, that step requires the 2 column tx2gene mapping. To try and clear this up a little in the actual code, I renamed it to be the t2g_quant_path and it is only set to the 3 column file with cr-like and splici, otherwise it is equivalent to the t2g_path used in the alevin step. If you have other suggestions on how to deal with this I am all ears!

Thanks for this explanation! That cleared it up in my head nicely, and the code changes I think help as well.

allyhawkins commented 3 years ago

I went back to read the example more carefully, and my understanding is that this should probably be conditional on the tx2gene file rather than the resolution mode alone? If the tx2gene file is three columns, that is the trigger for USA mode (I missed that point before), which then requires the mtx output so we can read the matrix (now 3x times longer) "manually" rather than with tximport/tximeta. Since the use of 3col is contingent on cr-like and splici, I would expect that to be condition here.

You are very right about the mtx output being dependent on the use of the 3 column tx2gene mapping, since that is how it designates USA mode or not. I went ahead and defined the use of that parameter within the same if statement for use of the 3 column file to address that. Although it does sound like we can get rid of this in the future hopefully as I would also prefer to use tximport.

AlexsLemonade / alsf-scpca

Adjust alevin to use splici index and USA mode #97