McGranahanLab / TcellExTRECT

Other
45 stars 20 forks source link

Index Not Found But It Exists #6

Closed DarioS closed 2 years ago

DarioS commented 2 years ago

https://github.com/McGranahanLab/TcellExTRECT/blob/c155d9fd1082621aae8eedaa2a15f4cb5900c524/R/getCovFromBam.R#L23-L25

The pipeline we use creates file names such as:

-rw-rw---- 1 cew562 hm82 242G Feb 24  2020 OSCC_1-P.final.bam
-rw-rw---- 1 cew562 hm82 9.6M Feb 24  2020 OSCC_1-P.final.bai

TcellExTRECT is the first bioinformatics tool in 18 months that hasn't worked with them. The corrected code is

if(!((file.exists(bai.path) || file.exists(bai.path2))))

You are missing a pair of round brackets which makes the logical evaluation order incorrect. Also, error message should use "proceeding" instead of "preceding".

https://github.com/McGranahanLab/TcellExTRECT/blob/c155d9fd1082621aae8eedaa2a15f4cb5900c524/R/getCovFromBam.R#L26

rbentham commented 2 years ago

Hi Dario,

Thanks for your attention to these issues, please understand that T cell ExTRECT is a very new tool and has only been public for a week, we are working hard to eliminate any bugs in the existing code and appreciate your input.

I am a bit confused by this as I can not replicate your issue, see below this quick dummy example using identical named files to those listed above, which only fails upon use of samtools as the files are empty and fails in the correct way when the dummy bai file is removed.

image

I also tested your suggestion for the corrected code and it seems equivalent to the code currently in the function. bai.path and bai.path2 should both be a single value so the use of || vs | is equivalent, and as I understand it having double brackets around everything does not change the logical evaluation e.g in the situation where the index file OSCC_1-P.final.bai exists but not OSCC_1-P.final.bam.bai your suggestion leads to !(( FALSE || TRUE )) which is identical to the current implementation !(FALSE | TRUE)

I was wondering if this issue is from trying to run the function on multiple bam files at once? Right now it is not vectorised to do that. I have added a test at the beginning to stop the function if more than one bam file is supplied.

DarioS commented 2 years ago

No, just one BAM file. I was in a hurry to go to a meeting and now I notice that your brackets do wrap around both the first and second conditions correctly.

I found the real source of the problem. Our pipeline outputs BAM files to a directory called Final_bams and I provide the absolute path to a BAM file. When the package code bai.path2 <- paste0(gsub('.bam','',bamPath),'.bai') runs, it coverts the absolute path /scratch/hm82/OSCC/Final_bams/OSCC_1-P.final.bam.bai into /scratch/hm82/OSCC/Finals/OSCC_1-P.final.bai. Note how the directory name was undesirably changed as well as the intended file name suffix. It would be safer to use gsub("bam$", "bai", bamPath). The dollars sign ensures it only matches the end of a file name. With my tweaked pattern:

> bamPath
[1] "/scratch/hm82/OSCC/Final_bams/OSCC_1-P.final.bam"
> gsub("bam$", "bai", bamPath) # Don't match bam in directories' names.
[1] "/scratch/hm82/OSCC/Final_bams/OSCC_1-P.final.bai"

Thus, the directory name is not accidentally changed by the pattern matching and substitution. Only BAM file name will be.

rbentham commented 2 years ago

Thanks, this has been fixed in the latest commit