BS-Seeker2 alignments produce "almost" empty bam files after masking

pablobio commented 1 year ago

Dear Adam,

Currently, I am trying to use Revelio to mask bam files obtained after the alignment of the WGBS read by Bsseaker2. I tried both ways to run Revelio (creating the proper MD tags and using the reference genome directly). However, in both ways, the generated bam files are substantially smaller than the original bam files (the log files generated for all the bam files show that the program successfully ran). Additionally, when I run the variant calling pipeline the vcf files generated are empty. I tried to run Revelio on "raw" bam files, sorted by position, sorted by position with duplicates removed using both possibilities (creating the proper MD tags and using the reference genome directly) and I always got the same results. Would you have any suggestions about how to fix this issue? Thank you very much.

bio15anu commented 1 year ago

Hi Pablo,

Sorry to hear you are having problems with Revelio. The way the program works is it splits the input bam file by scaffold and runs the masking in parallel depending on however many threads you have specified. When the masking is ready, there is a separate "merger" process which recursively combines completed scaffolds back together into one final bam file, based on the open file limit from your system. I suspect in your case there is an issue with this part of the process?

Are you able to investigate the temp directory (default: /var/tmp) and see what is going on with the individual bam files prior to merging? You can provide a custom temp location using the -t parameter. It would also be good to know how many threads you are running, and what the open file limit is set to on your system?

pablobio commented 1 year ago

Thank you very much for the quick reply, Adam.

I am running revelio in an HPC using 15 threads. I am going to try to run including the -t parameter using the temporary directory. As soon as I get a result I will inform you. Just as a matter of curiosity, I had an issue with the os.replace function. due to the architecture of the system, it was not able to use this function and I edited the program shutil.move to fix the issue. I checked in another system with a single bam file an "original" version of the program with os.replace and the result was the same obtained with the shutill.move. I am mentioning this just inc ase you didn't faced this issue before running revelio.

bio15anu commented 1 year ago

As far as I would understand it, it should work in exactly the same way. According to the python docs for shutil.move though, in many cases it simply uses os.rename under the hood. If you are having an issues with the os library are you sure those issues aren't persisting with the new command?

In any case, the command should only execute on the final file once all merging processes are completed and the thread pool is closed. Depending on what you did here exactly, there is a possibility that the wrong file (i.e. one of the intermediary files) is moved instead perhaps?

bio15anu commented 1 year ago

Does the final file contain only alignments from a single scaffold, or several?

pablobio commented 1 year ago

Thank you, Adam.

I don't think that this could be the problem as I tried to run a version of the script without changing anything in another server (where I don't have the issue with os.replace) and I get the masked bam file with exactly the same size. When I have an issue with os.replace, Revelio was not able to save the output file in the folder. Just in order to provide a complete description of the error, when I used the revelio with os.replace in the HPC the error informed was "source and destination are on different filesystems". I believe that the most probable reason for the error that I initially report could be in the combination of the temporary files, as you mentioned. Thank you!

bio15anu commented 1 year ago

Interesting, in which case, like you say, it seems an issue with your server architecture wherein the default temp directory where the intermediate files are located is somehow partially inaccessible to Revelio. Hopefully then, using the -t parameter and specifying a local directory on the same file system (i.e. in your home directory perhaps?) should solve the problem, without needing to rewrite os.replace to shutil.move?

Can you confirm if that indeed solves the issue?

pablobio commented 1 year ago

Hi Adam,

It didn't work to include the -t parameter. I am sending below the message that I got from revelio after the process to finish.

Total alignments read from BAM: 428125776

Success!

However, the masked bam file generated has only 3361 bytes. Additionally, after running samtools flagstat for the same file, all the statistics are equal to zero. The original bam file has around 20 Gb.

I still don't know what could be the cause of this issue. Must the reference file be indexed or pre-processed in some way?

I am sorry to keep bothering you.

bio15anu commented 1 year ago

Okay it sounds like there are a couple of things going on. I understood that it worked on "another server" but not on your main HPC, is this correct or did I misinterpret? At the end you should get a masked bam file which is exactly the same size as the original bam file in terms of the number of alignments.

For clarity:

the default value for the -t parameter is /var/tmp and you should now be setting it to point to a path which is on the same file system as where you run the program e.g. -t ~/my/new/temp/directory - this should avoid the error "source and destination are on different filesystems"
in terms of pre-processing, the original bam file should be sorted by coordinate and indexed using e.g. samtools index calmd.bam in order for the parallel threads to randomly access individual scaffolds for masking. I think this last part is missing from the readme so I will correct that!

pablobio commented 1 year ago

Hi Adam,

No, it didn't work on another server. I got the masked bam file with 3 Kb as well, as in the HPC. Actually, what worked in the other server was the use of os.replace. But in the end, the result was the same. The bam files are sorted by coordinate and indexed. The temporary folder was set to the same file system as well (in the HPC). I don't know if could be some issue with the reference file. However, it is exactly the same used during the alignment step with Bsseeker2.

bio15anu commented 1 year ago

Hi Pablo,

In that case, would you be able to provide a small subset of the data so that I could try and recreate the issue and solve it on this side? I think there is not enough information here to solve the problem you are experiencing.

bio15anu commented 1 year ago

In the meantime, you could also try running the EpiDiverse/SNP pipeline and see if the issue persists in that case?

pablobio commented 1 year ago

Hi Adam,

Yes, sure. I can share the files from one of the samples and the fasta file for the reference which I am using. How do you want me to share the files with you? In the meantime, I will run the EpiDiverse pipeline. Thank you very much.

bio15anu commented 1 year ago

Would WeTransfer be an option? I think you sent me an email around the time you opened this issue and you can use that same address

pablobio commented 1 year ago

Hi Adam,

I just shared with you a sample of the files by WeTransfer. In the meantime I was preparing the files, I tried to use the EpiDiverse/SNP pipeline. I ran the pipeline using an environment for python 3.9 and an environment for python 2.7. While I was running the pipeline using python 3.9 I got an error related to a print command in one of the scripts (fasta_generate_regions.py). To be more specific, the error was: SyntaxError: Missing parentheses in call to 'print'. Did you mean print("usage: ", sys.argv[0], " ")?. This error is caused he you try to run a script with the syntax from python 2 on python 3. Therefore, I changed to the environment created using python 2.7. However, while I was running the pipeline using Python 2.7 version I got an error related to the replace command "AttributeError: 'module' object has no attribute 'replace'" in the script change_sam_queries.py. Therefore, I was not able to finish the EpiDiverse/SNP pipeline. Thank you very much for your kindness in helping me with these issues. Cheers.

bio15anu commented 1 year ago

Hi Pablo,

I am downloading the files and will take a look asap.

With regard to the EpiDiverse/SNP pipeline, you do not need to worry about installing individual software dependencies manually. It will download everything and manage an environment automatically for you, so long as you have at least one of conda, docker, or singularity installed on your system. This completely circumvents the need for you yourself to manage different environments for python, for example.

pablobio commented 1 year ago

Thank you Adam,

I try to use the environment because while I was running the pipeline and letting it manage the software and dependencies I got this error: ".command.sh: line 2: fasta_generate_regions.py: command not found". Therefore, I tried to install freebayes manually.

Cheers.

bio15anu commented 1 year ago

May I ask which of conda, docker or singularity you would choose to use in your case?

If you would use conda, for example, then you would have to run the pipeline with the -profile conda option. Note it is only 1 dash instead of the 2 dashes which are necessary for other options (i.e. -profile and not --profile). The error above indicates that the correct profile was not initialised during the running of the pipeline.

pablobio commented 1 year ago

Hi Adam,

I am using conda. My command line to run the pipeline is: ./nextflow run EpiDiverse/SNP -profile conda --variants --input RFI_bam/ --reference /home/pablo/meth_milk_sheep/reference_RambV2/GCF_016772045.1_ARS-UI_Ramb_v2.0_genomic.fna

pablobio commented 1 year ago

I will try to update conda. Just in case.

bio15anu commented 1 year ago

Then this looks like it could either be an issue with conda as you say, or it might be a problem with the version of nextflow.

You can try changing the command to NXF_VER=20.07.1 nextflow run EpiDiverse/SNP -profile conda --variants --input RFI_bam/ --reference /home/pablo/meth_milk_sheep/reference_RambV2/GCF_016772045.1_ARS-UI_Ramb_v2.0_genomic.fna

There is really very little going on with -profile conda it is simply pointing to the conda environment in the location env/environment.yml which indeed contains freebayes=1.3.2 already as a dependency. All that happens is Nextflow then knows automatically to create a conda environment from this yml file and run the pipeline with it.

bio15anu commented 1 year ago

Hi again,

I have been looking through the data you sent me now and found an issue with the Dmilk10C_filter_RFI.bam file - there are no base qualities present! Field number 11 in the bam file is meant to contain a string of equal length to the sequence in field 10, with base quality scores represented as ASCII symbols, but in your file all strings have been replaced with "*".

If there are no base qualities available, there is no possible way to run the double-masking procedure. Do you know if anything could have happened during the preprocessing of this file to remove the base qualities?

bio15anu commented 1 year ago

Upon further investigation it seems that this is the standard behaviour of BS-Seeker2, to produce alignment files without base quality scores. Though technically this is not in breach of the official sam/bam specification, it should be noted that many downstream analyses such as variant calling are dependent on the availability of this information in order to work properly. It is a bit surprising to me that BS-Seeker2 neglects to provide it - I expect the reason is to save on disk space.

Before running revelio, I would advise either i) to pre-process the BAM files from BS-Seeker2 so that the appropriate base quality scores are extracted from the fastq files and appended correctly to the corresponding alignments, or ii) to use another alignment software altogether which provides the expected information. For this I would recommend bwa-meth.

pablobio commented 1 year ago

Thank you very much, Adam! I was reading about Bsseeker2 in order to figure out if the issue was during the alignment step. Do you have any suggestion of software or script that can update the base quality scores on bam files using the original fastq files?

Cheers.

bio15anu commented 1 year ago

I have added the utility script util/add_bq_from_fastq.py which should help you perform the task. I haven't tested it extensively, so please make sure to look through your results and double-check that they make sense. All sequence and base quality strings should be the same length, and in alignments to the reverse strand they should appear in reverse orientation relative to the fastq files.

The script requires the same pysam dependency as Revelio, and for the moment it takes only uncompressed fastq. You should run it like this: ./add_bq_from_fastq.py -1 read_1.fastq -2 read_2.fastq input.bam output.bam

pablobio commented 1 year ago

Thank you very much, Adam. I will try to use the script and check the results. As soon as I have some results I will provide you with some feedback.

cheers.

bio15anu / revelio

BS-Seeker2 alignments produce "almost" empty bam files after masking #3