cbrackley / capC-MAP

A Software Package for Analysis of Capture-C data
GNU General Public License v3.0
3 stars 1 forks source link

Error: "cannot open file srt_aligned.sam." while running capCmain. #4

Closed AgustinArce closed 3 years ago

AgustinArce commented 3 years ago

Dear Chris,

I am having trouble running capCmain, I am receiving the message:

ERROR in main processing stage : cannot open file srt_aligned.sam.

I have tried installing capC-MAP with conda and compiling it from source, in a PC and in a cluster, and I always get the same error. I was initially running "capC-MAP run" and when I traced back the error I found the problem was the capCmain command (written to capC-MAP.commands.log). Now I just run capCmain on the files produced by the "capC-MAP run" pipeline and get the same message.

I have also tried providing relative and absolute paths to srt_aligned.sam, always using TAB-completion, with no success. In fact, fragments and targets bed files are properly read with absolute or relative paths, so this does not seem to be the problem.

Looking into the source code (I have no C++ programming experience), the piece of code reading the sam file seems to be in main_process.cc, line 115; which I believe uses code from parse_sam.cc, line 99 (error in line 150).

I would appreciate any help on how to solve this problem.

Best, Agustín

PC: OS Ubuntu 20.04, Python 3.6.5 (with biopython), bowtie 1.3.0, samtools 1.9-52 Cluster: OS CentOS 6.10, Python 3.6.1 (with biopython, I have also tried with ver 2.7), bowtie 1.3.0, samtools 1.8

cbrackley commented 3 years ago

Dear Agustín, I'm not sure what could be causing this problem. I think you are correct about which line in the code is throwing the error. But this should only give an error if the file isn't able to be opened. If you look at the contents of the srt_aligned.sam file does it look as expected, with the correct sam format? Presumably there is no problem with file permissions or a a full disk? The only other thing I can think of is that there is some issue with the samtools version: I think we were using version 1.3.1 when we wrote the software. It might be that something has changed, and the file contents are not what is expected. Though I would expect a different error in that case. Were there any messages in the log file from samtools? Best, Chris

AgustinArce commented 3 years ago

Dear Chirs,

Thank you very much for your answer.

The sam file seems correct, although I agree the error points to the file not being opened and not to a format problem. Despite this, the sam file has a header with the usual @HD, @SQ and @PG lines. In fact, this file is produced by a previous command in the "capC-MAP run" pipeline which converts the srt_alinged.bam file into this sam file with samtools, and I believe this fails with a bad bam file (at least this happens with a headerless bam file). However, as this was done with a newer samtools version, I tried regenerating the file with samtools 1.3.1 (in another conda environment) and then got the same "cannot open file" error (the file had the same header) Just in case, I also tried manually excluding the @HD, the @PG or both lines with the same result. I tried this since the SAMv1 format specification indicates that the @HD line is optional and the @PG was not present in older SAM files.

There is plenty of disk space, particularly in the cluster, so I do not believe this is a problem with filling the disk. Regarding file permissions, the srt_aligned.sam file has the same permissions as the rest of the files in the output folder, with read permissions for the owner, the group and all users; and write permissions for the owner and the group ("-rw-rw-r--", as would be set with "chmod 664").

I checked the log files and there are no other errors, only some warnings indicating that digested reads shorter than 4 characters are being skipped.

Please let me know if you think there are other things I can try or check. And again, thank you very much for your help.

Best, Agustín

cbrackley commented 3 years ago

Dear Agustín, I'm not sure what can be going wrong. It should treat any line which begins with '@' as a header line and ignore it. What happens if you try the capCmain command, but instead of the srt_aligned.sam you use an empty file. It should then throw a different error message; that should confirm at which point in the code the error is being thrown. Could you also send me the command line you are using for capCmain (from the log file).

AgustinArce commented 3 years ago

Dear Chris,

I created an empty file with "touch" and received the same "cannot open file" error. I tried a different file and also a non-existent file and the error is the same. I fear it is not actually getting the right path to the file, even when I use the absolute path. Could it be appending something to the path?.

I get the same error if I use a wrong path to the bed files, targets or restriction fragments, but then when they are correct I get the messages starting with "...Loaded restriction enzymes from file" or "...Loaded list of 3142 targets".

I have attached the capC-MAP.commands.log file. This is the output of "capC-MAP run". Please note that the last command has a relative path to srt_aligned.sam, although the "capC-MAP run" is not being run from the output folder, which is capC-MAP/test, it is being run inside a snakemake pipeline in the "root" folder of my analysis. I have since tried all different commands I commented on by directly running the last capCmain command with multiple alternative paths (absolute, relative, wrong sam path, wrong bed path, empty sam, etc) from the console, not with snakemake.

If it is of any help, in between runs I have to remove the captured_validpairsFragment*.pairs (empty) files for a new run to execute. They are being created although the sam file is not read. The number of files is 1021, not the number of targets I input (3142. Maybe a wrong assumption since I have not read the details about this output).

Best, Agustín

capC-MAP.commands.log

cbrackley commented 3 years ago

Hi Agustín, Since the same error crops up even when the file is empty, I think maybe it is something else which is causing the problem. Looking at the code (its been a while since I wrote it), I think is could be a problem when the "validpairs" files are created. Indeed there should be one per target. When the sam file is opened it uses the same io object as was used to check if the validpairs file already exist, so it could be throwing an error because the previous file was not closed properly. One thing to try, do you get the same error if you remove the '-i' option from the capCmain command. Second thing to check: how you have named the targets. Since these are used in the file names, if the target names e.g. have characters which cannot be used in a file name, ; : / & etc., this could cause a problem. Also if there is a duplicated target name (though probably it checks for that). Best, Chris

AgustinArce commented 3 years ago

Dear Chris,

I do get the same error when I remove the "-i" option from the command.

Regarding the fragment names, I was using an underscore "_" in the name, this is, names were "Fragment_1" to "Fragment_3142". I removed the underscore and then changed "Fragment" to "F" (and even only numbers), aiming to reduce name length, with no success. I also checked and fragment names are not duplicated.

What is odd is that I checked validpairs files produced and there is a clear pattern. It is only creating files of fragments which contain a number that starts with 1. To be clear, when I list them, remove common parts of the name and sort them with sort -h (human-numeric-sort), I see it produces a file for fragments: 1, 10, 11-19,100-191,1000-1917. Therefore, out of the 3142 fragments it is producing only 1021 files, for some reason stopping in 1917 (although by this "pattern" it should produce numbers up to 1999).

As you proposed, some problem with how these files are produced might be causing the error with the sam file. Is there anything I can check based on this?

Best, Agustín

cbrackley commented 3 years ago

Hi Agustín, This could be a problem with a limit on the number of open files which is allowed. I think this depends on the operating system, and might be set at 1024 by default on linux systems. I think you can increase it with the 'ulimit' command, but you might need sys admin rights. The way we coded up capCmap was to open one file for each target for 'intra chomosomal' reads, and if the '-i' switch is used, it also opens a second file for each target for interchromosomal reads. I don't think we ever tested a case with 1000s of targets, only 100s. As a test, I would try running the capCmain command but use a smaller list of targets, e.g. a few 100. If this starts to run with no problems, then its most likely down to the number of open files. The solution would then be to increase the limit (I think its the 'ulimit -n' command in bash, not sure about other systems). A work around would be to split the targets into groups, and run the analysis multiple times. You don't need to run the whole thing again, just from the capCmain command onward. The drawback of this would be that the statistics generated for counting things like 'number of reads which don't map to a target' would not be correct. Best wishes, Chris

AgustinArce commented 3 years ago

Hi Chris, That was the problem!! I checked it with ulimit -Sn and it was indeed 1024. I ran it with 100 targets and it worked. It produced the validpairs files, the validinterchom files, the captured_report.dat and captured_interactioncounts.dat files. I might solve it by splitting the fragments in groups, as you suggest, but will be careful with the resulting counting statistics. Thank you very much for your help and prompt responses! Best wishes, Agustín