APAF-bioinformatics / ProteomeScholaR

GNU Lesser General Public License v3.0
1 stars 0 forks source link

Potential issue with chooseBestProteinAccession #56

Closed Bucket-Chemist closed 2 days ago

Bucket-Chemist commented 4 days ago

This is creating skews in the data for the cleaned accession if there are duplicates

Values getting summed so you end up with large values relative to rest of the dataset

summed_data <- protein_log2_quant_cln |> mutate( !!sym(protein_id_column) := purrr::map_chr( !!sym(protein_id_column), (x){ str_split(x, delim)[[1]][1] } ) ) |> pivot_longer( cols = !matches(protein_id_column) , names_to = "sample_id" , values_to = "temporary_values_choose_accession") |> group_by( !!sym(protein_id_column), sample_id ) |> summarise( is_na = sum( is.na(temporary_values_choose_accession )) , temporary_values_choose_accession = sum( temporary_values_choose_accession, na.rm=TRUE) , num_values = n() ) |>

Suggest change to mean?

, temporary_values_choose_accession = mean(temporary_values_choose_accession, na.rm=TRUE)

or transform back to linear, take mean and log2 again?

Example of issues downstream below

image

image

Bucket-Chemist commented 4 days ago

cloned out function and tested locally with change to mean()

Data looks alot less funky

image

image

IgnatiusPang commented 4 days ago

I wonder if we can have an option to choose mean or sum and have the default to mean? (Edit: Potential issue with chooseBestProteinAccession #56)

IgnatiusPang commented 4 days ago

yeah for some reason when we do choosebestprotein on that dataset it either sums them all up to massive values, or introduces a pc2 of 23596% if swapped to mean (when you run pca on cyclicloess transformed)

Bucket-Chemist commented 2 days ago

fixed and implemented mean, median or sum