BIMSBbioinfo / pigx_rnaseq

Bulk RNA-seq Data Processing, Quality Control, and Downstream Analysis Pipeline
GNU General Public License v3.0
21 stars 12 forks source link

Avoid runtime error during merge, no headers in table #113

Closed smoe closed 2 years ago

smoe commented 2 years ago

This is the PR to address what likely is an incompatibility with recent versions of R as found the hard way in Issue #111 .

rekado commented 2 years ago

Thank you for the patch! I wonder: could we add a test to ensure we don't run into similar issues in the future?

borauyar commented 2 years ago

@smoe I think I wasn't careful when reviewing this PR. The merging procedure has to assume that there aren't duplicate column names. The changes you propose actually breaks the pipeline :) The count files for each sample has to consist of two columns: 'V1', and unique sample name. The sample names must be unique for each count file, so when merging multiple data.frames by V1, it should create a table with unique column names that correspond to the list of samples found in the sample sheet. This should be compatible with the colData file which is important because the colData file and the merged counts table are used for downstream analysis. This means the counts file must contain the header, which should consist of sample names from the sample sheet.

In your proposed solution, we lose all sample names, so it breaks the pipeline. Could you send me a use case so that I can look into the problem? I would need the sample sheet. Also, it is important to see how the individual count files look in your pipeline output. Could you show some examples e.g. from mapped_reads/hisat2/<sampe name>.read_counts.tsv?

smoe commented 2 years ago

@borauyar I addressed two problems in one with this patch. The first was that there were already not sample names for the input of merge and hence the mapping of too many v1.x and v1.y onto each other. Maybe that is already a bug that surfaced on my end for some reason. And yes, this is also a problem of the merge routine which should not rename onto something that is already existing but instead come up with v1.y.y or v1.y2 or whatever.

Or is this about the write.table after the merge? All the column names are non-informative but the order prevails. The default write.table had the v1... written as headers but left no empty header over the row.names which then irritated the downstream routine about the first row having fewer columns than the others.

My hunch was that some default attributes have changed with the advent of R 4.1 and that this was why there are differences. But for that write.table ... I would admittedly have thought that my fix would be R-version independent and was wondering a bit that it just all worked with the new guix installation. And I am surprised even more since the guix install pigx-rnaseq apparently also works with a 4.1 version of R.

Puzzled.

borauyar commented 2 years ago

The files that are merged should look like the following:

When you read each file using data.table::fread it should be a table of 2 columns, where first column name will be V1 and the second column name will the name of the sample, for which the counting was executed.

File -1

,UHR_Rep1
ENSG00000056487,15
ENSG00000100347,0
ENSG00000138944,158
ENSG00000138964,17

File -2

,UHR_Rep2
ENSG00000056487,16
ENSG00000100347,0
ENSG00000138944,107
ENSG00000138964,18

This code below will read each file introducing a new column V1 because it is a data.table and keep the second column that has the sample name.

counts <- lapply(count_files, function(f) {
  data.table::fread(f)
})

and the merging code will merge by V1 as it is supposed to be common for all the tables read in.

If the per sample count files don't look like above examples, then we could look for the error where the sample-specific count files are saved.

These are for hisat2/star based counts only. If you have issues for Salmon-based counts, then we need to look into the script counts_matrix_from_SALMON.R where all transcript/gene level files are read and combined using sample names extracted from colData file.

borauyar commented 2 years ago

The issue could also be about which library's merge function is used at run time. To avoid this ambiguity, we could replace merge with data.table::merge.data.table in:

counts_all <- as.data.frame(Reduce(function(dtf1, dtf2)
  data.table::merge.data.table(dtf1, dtf2, by = "V1", all.x = TRUE),
       counts))
smoe commented 2 years ago

@borauyar , @rekado , I am afraid we need to revisit this one. The problem is that the column names are not set when the data files are read:

> counts <- lapply(count_files, function(f) {
  data.table::fread(f)
})
> length(counts)
[1] 24
> names(counts)
NULL
> head(counts[[1]])
                   V1     V2
1:                    158320
2: ENSOCUG00000000004   2259
3: ENSOCUG00000000005   1919
4: ENSOCUG00000000006   1081
5: ENSOCUG00000000007     36
6: ENSOCUG00000000008      9

and the reason is that the name of that sample is a number that the defeault header="auto" does not recognize as a header:

$ head .../output/mapped_reads/hisat2/158368.read_counts.csv
,158368
ENSOCUG00000000004,2464
ENSOCUG00000000005,1517
ENSOCUG00000000006,382
ENSOCUG00000000007,13
ENSOCUG00000000008,1
ENSOCUG00000000009,6327
ENSOCUG00000000010,6505
ENSOCUG00000000012,6659
ENSOCUG00000000013,294

I do not have an exact overview from where this routine is called, but since the merge fails if headers were not read, I find the current implementation to miss being explicit here. Please:

counts <- lapply(count_files, function(f) {
-  data.table::fread(f)
+  data.table::fread(f,header=TRUE)
})

Above change fixed it for me.

Sidenote: These numerical sample names were not my idea - the sequencing center came up with them.