peeratant commented 5 years ago

According to the code below, normalized function seems to remove row that has total count equal to zero.

If I want to merge computation result back to original data, I have to follow these steps

looping each row of original data
check whether current row has total count as 0 or greater if total_count > 0 then replace(original_data_row[o_index], imputed_data_row[i_index] )

Although this approach may yield satisfied result but the performance will greatly affect which really matter for big dataset.

Do you have any suggestions for improve the performance in this case?

` normalize_data <- function (A) {

Simple convenience function to library and log normalize a matrix

totalUMIPerCell <- rowSums(A);
if (any(totalUMIPerCell == 0)) {
    toRemove <- which(totalUMIPerCell == 0)
    A <- A[-toRemove,]
    totalUMIPerCell <- totalUMIPerCell[-toRemove]
    cat(sprintf("Removed %d cells which did not express any genes\n", length(toRemove)))
}

A_norm <- sweep(A, 1, totalUMIPerCell, '/');
A_norm <- A_norm * 10E3
A_norm <- log(A_norm +1);

}

`

linqiaozhi commented 5 years ago

Not sure I fully understand your question. Are you saying the normalize_data() function is too slow for large datasets?

Or are you saying that after you run normalize_data() and you want to merge the resulting matrix back into the original matrix, you do this via a loop that is very slow?

peeratant commented 5 years ago

Not sure I fully understand your question. Are you saying the normalize_data() function is too slow for large datasets?

Or are you saying that after you run normalize_data() and you want to merge the resulting matrix back into the original matrix, you do this via a loop that is very slow?

What I mean is that, when normalizing with this function. every gene that has 0 count in every cell will be removed in if (any(totalUMIPerCell == 0)) condition

For example, I use data set that have 20,000 genes but only 3,000 genes express with at least one cell. after using provided normalized function, 17,000 genes will be removed.

I hope to know the way that efficiently merge imputation result of 3,000 genes back to original data. (20,000 genes with zero count)

linqiaozhi commented 5 years ago

Got it.

Just run imputation on the subset of rows that are non-zero and then set the appropriate rows of the larger matrix to the result of imputation on the subset.

For example, something like this:

# Make a matrix that has a row of all zeros
A <- matrix(1,nrow=10,ncol = 5)
A[2,] <- 0 
print(A)

totalUMIPerCell <- rowSums(A)
toKeep <- which(totalUMIPerCell >0)
A_subset <- A[toKeep, ]

# Run imputation on A_subset. Here I'll just add the number 1, just for demo
A_subset <- A_subset + 1

A[toKeep, ]  <- A_subset
print(A)

peeratant commented 5 years ago

That's helpful, Thanks.

KlugerLab / ALRA

Provided normalized function #4

Simple convenience function to library and log normalize a matrix