shorten rownames - Githubissues

MHH-RCUG / haybaler

Haybaler: Collate/integrate reporting CSV or bam.txt files from the Wochenende pipeline https://github.com/MHH-RCUG/Wochenende

MIT License

5 stars 3 forks source link

shorten rownames #73

Open LisaHollstein opened 2 years ago

LisaHollstein commented 2 years ago

The rownames in the output csv's as well as the bacteria names in the heatmaps are very long. e.g

1_AE015929_1_Staphylococcus_epidermidis_ATCC_12228BAC 1_CP007601_1_Staphylococcus_capitis_subspcapitis_strain_AYP1020BAC 1_AP011540_1_Rothia_mucilaginosa_DY_18_DNABAC

they could be shortened to something like

Staphylococcus epidermis Staphylococcus capitis subsp capitis Rothis mucilaginosa

LisaHollstein commented 2 years ago

@colindaven

I wrote some code that does the following:

extract the species name (and subspecies) from the long name
rename the rows in the table
save as csv (same name as before, but with the extension "short")
save a table with information on the initiall name and the newer short name
if a species is multiple times in the table (e.g. multiple chromosomes and one row for each chromosome) the values of the rows are summed (and a warning that this has been done is issued)

colindaven commented 2 years ago

OK, good. Subspecies should hopefully not be present too often
Please use underscore "_" between words, makes it easier to code for in different languages
The extra table with "_short.csv" as extension sounds good
Values for the table summed. This is not appropriate, or only for non-normalized data like read counts. Normalized data like bact per human cell should be averaged (better median, but mostly have only two data points).

Thanks for this, I'll look forward to the implementation and PR

LisaHollstein commented 2 years ago

Okay, the rows (usually) aren't summed anymore.

In haybaler.py the "_short.csv" is only created if there aren't multiple rows for one species. Only for the read count table the rows are still summed and a "_short.csv" is created.

In shorten_names.R the rows just keep their old, long names

LisaHollstein commented 2 years ago

I just notized a problem:

The read_count_short table is in different oder than the normal read_count table. I think the easiest way is to just don't output any csv with short names at all, if the short names aren't unique.

colindaven commented 2 years ago

It would be nice to have a test for this problem too. Even just a line count of the two datasets, and if they're not the same output an error.

Short names look good for me so far.

species                               chr_length  gc_ref  Umwelt2_1_S20_R1  Umwelt2_2_S21_R1  Umwelt2_3_S93_R1  Umwelt2_4_S94_R1  Umwelt2_5_S95_R1
Moraxella_osloensis                   2434688.0   43.85   34502.64          153977.36         0.0               0.0               1293.59
Paracoccus_yeei                       3622127.0   67.18   8502.91           26258.12          0.0               0.0               0.0
Cutibacterium_acnes                   2522438.0   59.99   16523.65          13796.62          0.0               5061.46           11111.11

haybaler/control_dataset/haybaler_output$ wc -l *.csv
   160 bacteria_per_human_cell_haybaler.csv
   160 bacteria_per_human_cell_haybaler_short.csv
   154 excluded_taxa.csv
   160 read_count_haybaler.csv
   160 read_count_haybaler_short.csv
   160 reads_per_million_reads_in_experiment_haybaler.csv
   160 reads_per_million_reads_in_experiment_haybaler_short.csv
   160 reads_per_million_ref_bases_haybaler.csv
   160 reads_per_million_ref_bases_haybaler_short.csv
   160 RPMM_haybaler.csv
   160 RPMM_haybaler_short.csv