daveuu / baga

Bacterial and Archaeal Genome Analyser
GNU General Public License v3.0
9 stars 2 forks source link

failure of BAGA filter rearrangements #14

Open stefangoe opened 7 years ago

stefangoe commented 7 years ago

Dear BAGA developers,

I would be happy to get support with the following problem concerning rearrangement analysis within the BAGA pipeline, installed on December 2nd from git clone http://github.com/daveuu/baga.git (BAGA version v0.2.1, Python version 2.7.5)

Executing the pipeline described in https://github.com/daveuu/baga/blob/master/docs/guide1.md works well until step 3 (alignment of reads). However, in step 4 (reaarangements), dependency check baga/baga_cli.py Dependencies --checkgetfor Structure is still OK but check for rearrangements by baga/baga_cli.py Structure --reads_name myreads --genome_name mygenome --ratio_threshold 0.4 --check ends without error message and the following prompt:

baga.CollectData.Genome-mygenome.baga Loading alignments information for: taudien_myreads__mygenome from AlignReads output indexing alignments/mygenome/mysample___mygenome_dd_si_realn.bam Using pySAM

The files

si_realn.bam.bai mean_insert_size.baga *realn_depths.baga

were created but when starting the next step by baga/baga_cli.py Structure --reads_name myreads --genome_name mygenome --plot the error message is:

baga.CollectData.Genome-mygenome.baga Loading alignments information for: myreads__mygenome from AlignReads output Loading genome mygenome Could not find: baga.Structure.CheckerInfo- myreads___ mygenome.baga Traceback (most recent call last): File "baga/baga_cli.py", line 1979, in checker_info['genome_name'], NameError: name 'checker_info' is not defined

When repeating the check, the script obviously starts at a later point, since the depth- and insert size files are already available, but finally ends with an error message:

Collecting coverage for myreads_ reads aligned to mygenome Found and loaded previously scanned coverage depths from alignments/mygenome/Myreadsmygenome_dd_si_realndepths.baga which saves time! Calculating mean insert size for myreads reads aligned to mygenome Loaded insert size from alignments/mygenome/mysample__mygenome_dd_si_realn_mean_insertsize.baga Previously, no paired, aligned reads found in alignments/mygenome/mysamplemygenome_dd_si_realn.bam Structure analysis for rearrangements is not possible Skipping: no aligned pairs found

We suspected the problem somewhere around line 135 in Structure.py and shifted up _pysam.index(BAM) (before try):

instantiate Structure Checkers # instantiate Structure Checkers

checkers = {}
for BAM in BAMs:
    indexfile = _os.path.extsep.join([BAM,'bai'])
print ("Test: ", indexfile)
    if not(_os.path.exists(indexfile) and _os.path.getsize(indexfile) > 0):
        print('indexing {}'.format(BAM))
        fail = False
        _pysam.index(BAM)
    try:
            print('Using pySAM')
    _pysam.index(BAM)

Then the error message is:

Loading alignments information for: myreads__mygenome from AlignReads output ('Test: ', u'alignments/mygenome/mysample_mygenome_dd_sirealn.bam.bai') indexing alignments/mygenome/mysamplemygenome_dd_si_realn.bam Traceback (most recent call last): File "baga/baga_cli.py", line 1937, in ratio_threshold = args.ratio_threshold) File "mypath/baga_analysis/baga/Structure.py", line 143, in checkStructure _pysam.index(BAM) File “mypath/baga_analysis/local_packages/pysam/init.py", line 66, in call self.dispatch, args, catch_stdout=kwargs.get("catch_stdout", True)) File "pysam/csamtools.pyx", line 132, in pysam.csamtools._samtools_dispatch (pysam/csamtools.c:3259) File "pysam/csamtools.pyx", line 34, in pysam.csamtools._forceCmdlineBytes (pysam/csamtools.c:1380) File "pysam/csamtools.pyx", line 22, in pysam.csamtools._forceBytes (pysam/csamtools.c:1237) TypeError: Expected bytes, got unicode

Do you have any idea what’s going wrong? Thank you very much for your help!

daveuu commented 7 years ago

Hi Stefan,

The root of this problem has to do with text encoding and how Python handles it - version 2 and 3 are slightly different which can lead to some issues. This issue is familiar - I think it has been encountered and fixed already, but in a different development branch. So, in this instance you might be better off running that version of BAGA available from within the git repository.

I'm not sure what you know about git but it provides "branches": different versions existing in parallel in the same git repository which can be swapped between.

To use the new version I would recommend cloning baga from github.com/daveuu/baga again into a separate folder to keep things more manageable. Then 'cd' into the new repository (cd baga) and type:

git checkout dev1

This will transform the visible content to the newer version.

git status

Should then say: "On branch dev1"

You should then be able to run your commands again (but there may be small differences but you should find them as they arise i.e., fail. The commands will not change before the next proper version (which could be referenced in any publications you might do).

You'll need to start the analysis again, including downloading the reference genome because it will be saved in a different way with all replicons, each with their own accession number. The whole genome will have an Assembly Accession number instead from the NCBI Genome Assembly database e.g.,

http://www.ncbi.nlm.nih.gov/assembly/GCF_001457515.1

where GCF_001457515.1 contains NZ_LN831048.1, NZ_LN831049.1, and NZ_LN831050.1

Hopefully by using the newer version, you will not encounter the same problem. If you do encounter any problems, let me know.

stefangoe commented 7 years ago

Dear David,

thanks a lot for your fast and helpful response! As recommended I have installed again BAGA from github and everything seems to be fine with regard to transformation to the newer version:

git checkout dev1 Branch dev1 set up to track remote branch dev1 from origin. Switched to a new branch 'dev1'

git status

On branch dev1

nothing to commit, working directory clean

However, in the adaptor clipping and trimming step, new problems have arisen:

baga/baga_cli.py --nosplash PrepareReads --reads_name myreads --adaptors --trim --max_cpus 4 ERROR: Could not find the sickle executable at executable at /home/uni08/taudien/stau/sequence_mining/baga_analysis2/external_programs/sickle/sickle.

baga/baga_cli.py Dependencies --check sickle Checking for sickle: external_programs/sickle/sickle: [Errno 2] No such file or directory sickle: not found . . .

baga/baga_cli.py Dependencies --get sickle /cm/shared/apps/intel/compilers_and_libraries/2016.3.210/compiler/include/limits.h:1:1: warning: C++ style comments are not allowed in ISO C90 [enabled by default] // ^ /cm/shared/apps/intel/compilers_and_libraries/2016.3.210/compiler/include/limits.h:1:1: warning: (this will be reported only once per input file) [enabled by default] /cm/shared/apps/intel/compilers_and_libraries/2016.3.210/compiler/include/limits.h:36:54: error: missing binary operator before token "(" defined(has_include_next) && has_include_next() ^ make: *** [sliding.o] Error 1

Thanks again in advance for a recommendation...

Stefan

daveuu commented 7 years ago

Hi Stefan,

The command that fails downloads the source code for the program 'Sickle' and compiles it ready for use. The errors you are seeing are, I think, caused by a version mismatch between the compiler and the source code. I am surprised that switching BAGA versions has caused this problem because I think both versions are performing the same commands as far as the system and dependency installation is concerned. Are you using the same computer for both BAGA versions?

What is the output of this command:

gcc --version

gcc should be the compiler that is called when BAGA issues the command make. It seems your system defaults to using an Intel compiler which is complaining about sickle being non-ISO C90 compatible . . .

PS you can write using markdown so it is easier to read the code blocks. See here: https://guides.github.com/features/mastering-markdown/

stefangoe commented 7 years ago

Dear David,

thanks again for your help. To answer your question - yes, I used the same computer for both BAGA versions. But still some things are strange: 1) A simple "make" seemed to be enough to overcome the problems with sickle now. 2) The gcc version is 4.8.5 20150623 (Red Hat 4.8.5-4) 3) Everything runs fine now until alignment and deduplicating. The baga_AlignReads*.baga file has been created but the program was unable to open the _dd_si.bam files. Since these files are missing, in the next step (realignment around indels by GATK) fails due to assertion error.

Could you please help me once again ? Thanks a lot!

Stefan PS: I did not try markdown yet but will do it soon - sorry...

stefangoe commented 7 years ago

Dear David, just an additional comment: The dd_si.bam files were correctly created with the former BAGA version. Stefan