Closed allyhawkins closed 3 years ago
Thanks for all the helpful comments! I went back through and I believe I have addressed all of your suggestions and comments now. The main edits I did were the following:
tx2gene_3col_path
and am now using a separate if/else statement (or what is now being called the t2g_quant_path
- I'll explain why I still think we need both of these below). generate-permit-list
and now use a map within that process to assign the flag and barcode file if applicable. I think the main thing I am trying to understand is how the index and resolution options interact (or don't). It seems like if we use
splici
, we will always be usingcr-like
resolution? Or am I misunderstanding that? We might also want to usecr-like
with other tools, and perhaps that would become our new default...
In regards to your confusion between use of cr-like
and splici
together, the only way that we can use the USA mode with single-cell RNA sequencing (and subsequently use the functions written by alevin-fry developers) is to use those two conditions simultaneously. However, we have been using the splici
index with the full
resolution and using our own functions to read in spliced + intronic counts and combining them specifically in the context of the single-nuclei samples. It was my understanding that if we are still testing cr-like
vs. the full
resolution (for both snRNA and scRNA) and the use of splici
vs. cDNA
for the single-cell samples we would want to have the option of being able to use any combination of those two. Once we have narrowed down a default, then we could combine it into one argument if we feel like that is necessary. I'm not sure I understand what you mean by also using cr-like
with other tools?
I'm a little confused by the logic here. If we only use t2g_3col with the cr-like & splici settings, do we need to set t2g_3col_path to the other t2g file in the other cases? Looking at this again, it seems like we could put all of this in t2g_path as we don't (by my quick look) use that at all.
So this isn't actually true. For the scenario that we are doing cr-like
and splici
, we need both the 2 column tx2gene file and the 3 column file. The 2 column file is being used for the first alevin process during alignment and no matter what the resolution or index, we will always need to use the 2 column file for the first step. The 3 column file is what is needed only in the context of cr-like
and splici
and only during the last alevin-fry quant
step. Otherwise, that step requires the 2 column tx2gene mapping. To try and clear this up a little in the actual code, I renamed it to be the t2g_quant_path
and it is only set to the 3 column file with cr-like
and splici
, otherwise it is equivalent to the t2g_path used in the alevin step. If you have other suggestions on how to deal with this I am all ears!
So this isn't actually true. For the scenario that we are doing
cr-like
andsplici
, we need both the 2 column tx2gene file and the 3 column file. The 2 column file is being used for the first alevin process during alignment and no matter what the resolution or index, we will always need to use the 2 column file for the first step. The 3 column file is what is needed only in the context ofcr-like
andsplici
and only during the lastalevin-fry quant
step. Otherwise, that step requires the 2 column tx2gene mapping. To try and clear this up a little in the actual code, I renamed it to be thet2g_quant_path
and it is only set to the 3 column file withcr-like
andsplici
, otherwise it is equivalent to the t2g_path used in the alevin step. If you have other suggestions on how to deal with this I am all ears!
Thanks for this explanation! That cleared it up in my head nicely, and the code changes I think help as well.
I went back to read the example more carefully, and my understanding is that this should probably be conditional on the tx2gene file rather than the resolution mode alone? If the tx2gene file is three columns, that is the trigger for USA mode (I missed that point before), which then requires the mtx output so we can read the matrix (now 3x times longer) "manually" rather than with tximport/tximeta. Since the use of 3col is contingent on cr-like and splici, I would expect that to be condition here.
You are very right about the mtx output being dependent on the use of the 3 column tx2gene mapping, since that is how it designates USA mode or not. I went ahead and defined the use of that parameter within the same if statement for use of the 3 column file to address that. Although it does sound like we can get rid of this in the future hopefully as I would also prefer to use tximport.
Closes #91. This PR changes some of the parameters in the alevin-fry and cellranger workflows being used for benchmarking.
For alevin-fry, I have done the following:
index_name
andt2g
to oneindex_type
parameter to indicate if the workflow should be run using settings for the single-cell (cell
) or single-nucleus (splici
) samples. The default is set tocell
and will then assign theindex_name
and corresponding tx2gene mapping file.tx2gene_3col.tsv
file needed to run Alevin-fry in USA mode. This can only be used if theindex_name
is set tosplici
and the resolution iscr-like
. Otherwise, the 2 column tx2gene file is still used.params.filter = unfiltered
as the default. If that is changed toknee
then the--knee-distance
flag is used to filter true barcodes during thegenerate-permit-list step
. Based on our discussion of switching to only using the unfiltered version moving forward, I chose to set this as default and keepknee
as an option just in case moving forward.Originally, I was planning on using the splici index as the default index for all single-nuclei RNA-seq samples, but with the addition of testing the splici index on single-cell samples that became less straight forward. Instead, I included a map of index_names and tx2gene files based on the index_type (currently set as
cell
orsplici
) with the anticipation that if we do choose to use an index for single-cell vs. single-nuclei, we can use theseq_unit
column of the metadata to set the index and tx2gene (similar to how the barcode list is assigned now).For Cellranger, I set the default for single-nuclei samples to run with the pre-mRNA index using the
--include-introns
option, removing it as a parameter.I also did think about adding in checks to make sure the input parameters were correct, specifically the
params.filter
and theparams.resolution
options, but didn't think that was necessary quite yet. Probably something we want to consider adding if we do choose alevin-fry as our permanent workflow.