Closed pablobio closed 1 year ago
Hi Pablo,
Sorry to hear you are having problems with Revelio. The way the program works is it splits the input bam file by scaffold and runs the masking in parallel depending on however many threads you have specified. When the masking is ready, there is a separate "merger" process which recursively combines completed scaffolds back together into one final bam file, based on the open file limit from your system. I suspect in your case there is an issue with this part of the process?
Are you able to investigate the temp directory (default: /var/tmp
) and see what is going on with the individual bam files prior to merging? You can provide a custom temp location using the -t parameter. It would also be good to know how many threads you are running, and what the open file limit is set to on your system?
Thank you very much for the quick reply, Adam.
I am running revelio in an HPC using 15 threads. I am going to try to run including the -t parameter using the temporary directory. As soon as I get a result I will inform you. Just as a matter of curiosity, I had an issue with the os.replace function. due to the architecture of the system, it was not able to use this function and I edited the program shutil.move to fix the issue. I checked in another system with a single bam file an "original" version of the program with os.replace and the result was the same obtained with the shutill.move. I am mentioning this just inc ase you didn't faced this issue before running revelio.
As far as I would understand it, it should work in exactly the same way. According to the python docs for shutil.move
though, in many cases it simply uses os.rename
under the hood. If you are having an issues with the os library are you sure those issues aren't persisting with the new command?
In any case, the command should only execute on the final file once all merging processes are completed and the thread pool is closed. Depending on what you did here exactly, there is a possibility that the wrong file (i.e. one of the intermediary files) is moved instead perhaps?
Does the final file contain only alignments from a single scaffold, or several?
Thank you, Adam.
I don't think that this could be the problem as I tried to run a version of the script without changing anything in another server (where I don't have the issue with os.replace) and I get the masked bam file with exactly the same size. When I have an issue with os.replace, Revelio was not able to save the output file in the folder. Just in order to provide a complete description of the error, when I used the revelio with os.replace in the HPC the error informed was "source and destination are on different filesystems". I believe that the most probable reason for the error that I initially report could be in the combination of the temporary files, as you mentioned. Thank you!
Interesting, in which case, like you say, it seems an issue with your server architecture wherein the default temp directory where the intermediate files are located is somehow partially inaccessible to Revelio. Hopefully then, using the -t parameter and specifying a local directory on the same file system (i.e. in your home directory perhaps?) should solve the problem, without needing to rewrite os.replace
to shutil.move
?
Can you confirm if that indeed solves the issue?
Hi Adam,
It didn't work to include the -t parameter. I am sending below the message that I got from revelio after the process to finish.
Total alignments read from BAM: 428125776
Success!
However, the masked bam file generated has only 3361 bytes. Additionally, after running samtools flagstat for the same file, all the statistics are equal to zero. The original bam file has around 20 Gb.
I still don't know what could be the cause of this issue. Must the reference file be indexed or pre-processed in some way?
I am sorry to keep bothering you.
Okay it sounds like there are a couple of things going on. I understood that it worked on "another server" but not on your main HPC, is this correct or did I misinterpret? At the end you should get a masked bam file which is exactly the same size as the original bam file in terms of the number of alignments.
For clarity:
-t
parameter is /var/tmp
and you should now be setting it to point to a path which is on the same file system as where you run the program e.g. -t ~/my/new/temp/directory
- this should avoid the error "source and destination are on different filesystems"samtools index calmd.bam
in order for the parallel threads to randomly access individual scaffolds for masking. I think this last part is missing from the readme so I will correct that!Hi Adam,
No, it didn't work on another server. I got the masked bam file with 3 Kb as well, as in the HPC. Actually, what worked in the other server was the use of os.replace. But in the end, the result was the same. The bam files are sorted by coordinate and indexed. The temporary folder was set to the same file system as well (in the HPC). I don't know if could be some issue with the reference file. However, it is exactly the same used during the alignment step with Bsseeker2.
Hi Pablo,
In that case, would you be able to provide a small subset of the data so that I could try and recreate the issue and solve it on this side? I think there is not enough information here to solve the problem you are experiencing.
In the meantime, you could also try running the EpiDiverse/SNP pipeline and see if the issue persists in that case?
Hi Adam,
Yes, sure. I can share the files from one of the samples and the fasta file for the reference which I am using. How do you want me to share the files with you? In the meantime, I will run the EpiDiverse pipeline. Thank you very much.
Would WeTransfer be an option? I think you sent me an email around the time you opened this issue and you can use that same address
Hi Adam,
I just shared with you a sample of the files by WeTransfer. In the meantime I was preparing the files, I tried to use the EpiDiverse/SNP pipeline. I ran the pipeline using an environment for python 3.9 and an environment for python 2.7. While I was running the pipeline using python 3.9 I got an error related to a print command in one of the scripts (fasta_generate_regions.py). To be more specific, the error was: SyntaxError: Missing parentheses in call to 'print'. Did you mean print("usage: ", sys.argv[0], "
Hi Pablo,
I am downloading the files and will take a look asap.
With regard to the EpiDiverse/SNP pipeline, you do not need to worry about installing individual software dependencies manually. It will download everything and manage an environment automatically for you, so long as you have at least one of conda, docker, or singularity installed on your system. This completely circumvents the need for you yourself to manage different environments for python, for example.
Thank you Adam,
I try to use the environment because while I was running the pipeline and letting it manage the software and dependencies I got this error: ".command.sh: line 2: fasta_generate_regions.py: command not found". Therefore, I tried to install freebayes manually.
Cheers.
May I ask which of conda, docker or singularity you would choose to use in your case?
If you would use conda, for example, then you would have to run the pipeline with the -profile conda
option. Note it is only 1 dash instead of the 2 dashes which are necessary for other options (i.e. -profile
and not --profile
). The error above indicates that the correct profile was not initialised during the running of the pipeline.
Hi Adam,
I am using conda. My command line to run the pipeline is: ./nextflow run EpiDiverse/SNP -profile conda --variants --input RFI_bam/ --reference /home/pablo/meth_milk_sheep/reference_RambV2/GCF_016772045.1_ARS-UI_Ramb_v2.0_genomic.fna
I will try to update conda. Just in case.
Then this looks like it could either be an issue with conda as you say, or it might be a problem with the version of nextflow.
You can try changing the command to NXF_VER=20.07.1 nextflow run EpiDiverse/SNP -profile conda --variants --input RFI_bam/ --reference /home/pablo/meth_milk_sheep/reference_RambV2/GCF_016772045.1_ARS-UI_Ramb_v2.0_genomic.fna
There is really very little going on with -profile conda
it is simply pointing to the conda environment in the location env/environment.yml
which indeed contains freebayes=1.3.2 already as a dependency. All that happens is Nextflow then knows automatically to create a conda environment from this yml file and run the pipeline with it.
Hi again,
I have been looking through the data you sent me now and found an issue with the Dmilk10C_filter_RFI.bam
file - there are no base qualities present! Field number 11 in the bam file is meant to contain a string of equal length to the sequence in field 10, with base quality scores represented as ASCII symbols, but in your file all strings have been replaced with "*".
If there are no base qualities available, there is no possible way to run the double-masking procedure. Do you know if anything could have happened during the preprocessing of this file to remove the base qualities?
Upon further investigation it seems that this is the standard behaviour of BS-Seeker2, to produce alignment files without base quality scores. Though technically this is not in breach of the official sam/bam specification, it should be noted that many downstream analyses such as variant calling are dependent on the availability of this information in order to work properly. It is a bit surprising to me that BS-Seeker2 neglects to provide it - I expect the reason is to save on disk space.
Before running revelio, I would advise either i) to pre-process the BAM files from BS-Seeker2 so that the appropriate base quality scores are extracted from the fastq files and appended correctly to the corresponding alignments, or ii) to use another alignment software altogether which provides the expected information. For this I would recommend bwa-meth.
Thank you very much, Adam! I was reading about Bsseeker2 in order to figure out if the issue was during the alignment step. Do you have any suggestion of software or script that can update the base quality scores on bam files using the original fastq files?
Cheers.
I have added the utility script util/add_bq_from_fastq.py which should help you perform the task. I haven't tested it extensively, so please make sure to look through your results and double-check that they make sense. All sequence and base quality strings should be the same length, and in alignments to the reverse strand they should appear in reverse orientation relative to the fastq files.
The script requires the same pysam dependency as Revelio, and for the moment it takes only uncompressed fastq. You should run it like this:
./add_bq_from_fastq.py -1 read_1.fastq -2 read_2.fastq input.bam output.bam
Thank you very much, Adam. I will try to use the script and check the results. As soon as I have some results I will provide you with some feedback.
cheers.
Dear Adam,
Currently, I am trying to use Revelio to mask bam files obtained after the alignment of the WGBS read by Bsseaker2. I tried both ways to run Revelio (creating the proper MD tags and using the reference genome directly). However, in both ways, the generated bam files are substantially smaller than the original bam files (the log files generated for all the bam files show that the program successfully ran). Additionally, when I run the variant calling pipeline the vcf files generated are empty. I tried to run Revelio on "raw" bam files, sorted by position, sorted by position with duplicates removed using both possibilities (creating the proper MD tags and using the reference genome directly) and I always got the same results. Would you have any suggestions about how to fix this issue? Thank you very much.