Open LisaHollstein opened 2 years ago
@colindaven
I wrote some code that does the following:
Thanks for this, I'll look forward to the implementation and PR
Okay, the rows (usually) aren't summed anymore.
In haybaler.py
the "_short.csv" is only created if there aren't multiple rows for one species. Only for the read count table the rows are still summed and a "_short.csv" is created.
In shorten_names.R
the rows just keep their old, long names
I just notized a problem:
The read_count_short
table is in different oder than the normal read_count
table. I think the easiest way is to just don't output any csv with short names at all, if the short names aren't unique.
It would be nice to have a test for this problem too. Even just a line count of the two datasets, and if they're not the same output an error.
Short names look good for me so far.
species chr_length gc_ref Umwelt2_1_S20_R1 Umwelt2_2_S21_R1 Umwelt2_3_S93_R1 Umwelt2_4_S94_R1 Umwelt2_5_S95_R1
Moraxella_osloensis 2434688.0 43.85 34502.64 153977.36 0.0 0.0 1293.59
Paracoccus_yeei 3622127.0 67.18 8502.91 26258.12 0.0 0.0 0.0
Cutibacterium_acnes 2522438.0 59.99 16523.65 13796.62 0.0 5061.46 11111.11
haybaler/control_dataset/haybaler_output$ wc -l *.csv
160 bacteria_per_human_cell_haybaler.csv
160 bacteria_per_human_cell_haybaler_short.csv
154 excluded_taxa.csv
160 read_count_haybaler.csv
160 read_count_haybaler_short.csv
160 reads_per_million_reads_in_experiment_haybaler.csv
160 reads_per_million_reads_in_experiment_haybaler_short.csv
160 reads_per_million_ref_bases_haybaler.csv
160 reads_per_million_ref_bases_haybaler_short.csv
160 RPMM_haybaler.csv
160 RPMM_haybaler_short.csv
The rownames in the output csv's as well as the bacteria names in the heatmaps are very long. e.g
1_AE015929_1_Staphylococcus_epidermidis_ATCC_12228BAC 1_CP007601_1_Staphylococcus_capitis_subspcapitis_strain_AYP1020BAC 1_AP011540_1_Rothia_mucilaginosa_DY_18_DNABAC
they could be shortened to something like
Staphylococcus epidermis Staphylococcus capitis subsp capitis Rothis mucilaginosa