dputhier / pygtftk

A python package and a set of shell commands to handle GTF files
GNU General Public License v3.0
45 stars 6 forks source link

Segfault when using bed3 files #47

Closed qferre closed 5 years ago

qferre commented 5 years ago

Using peak_anno with a bed3 file triggers a segfault. No other messages are displayed, even with high verbosity.

Command example :

gtftk peak_anno -i hg38.gtf.gz -c hg38_ensembl.genome -p bed3_h3k4me3.bed -V 3

The command works fine the exact same file is converted to a bed6 with filler characters (eg. 'chr1 100 200' becomes 'chr1 100 200 A B C')

I suspect this is related to the argument formatter and could appear in other parts of the pygtftk project.

qferre commented 5 years ago

Opening the file in r+ mode instead of r fixes the problem, but is potentially error-prone : there is no reason those files should be edited.

arg_formatter.FormattedFile(mode='r+', file_ext='bed')

This bypasses the part of arg_formatter that calls make_tmp_file (only called when mode == 'r'). This could be the reason.

dputhier commented 5 years ago

Clearly we should not open it in 'r+'.

dputhier commented 5 years ago

Are you able to read this file simply using ?

file_bo = BedTool(string)

qferre commented 5 years ago

After investigation, it seems to fail precisely at line 482 in arg_formatter.py. Every line before works, every line after does not (as ascertained by painstakingly adding print('Everything up to her works') to test :)

I believe it is because it is trying to set the field name, which is non-existent as BedTools read the file as a bed3. I'll keep investigating.

dputhier commented 5 years ago

Yes.... The code here is buggy.

                for record in file_bo:
                    if field_count < 4:
                        record.name = 'region_' + str(region_nb)

                    fields = record.fields[0:3]
                    fields += [record.name,
                               record.score,
                               record.strand]
                    tmp_file.write("\t".join(fields))

Should be replaced by something like:

                for record in file_bo:
                    if field_count < 4:
                        name = 'region_' + str(region_nb)

                    fields = record.fields[0:3]
                    fields += [name
                               '0',
                               '.']
                    tmp_file.write("\t".join(fields))

The question is also what will happen with unstranded features...

qferre commented 5 years ago

pybedtools is known to throw segfaults when iterating over BedFile objects in certain conditions :https://github.com/daler/pybedtools/issues/82

I had already tried to fix it by doing pretty much the same modification as the one you posted in the comment above (also adding record.strand and record.score, to no effect) , but I was looking for a cleaner solution.

dputhier commented 5 years ago

Think about doing this modification in the develop branch.

Le jeu. 10 janv. 2019 à 15:40, Quentin Ferré notifications@github.com a écrit :

pybedtools is known to throw segfaults when iterating over BedFile objects in certain conditions :daler/pybedtools#82 https://github.com/daler/pybedtools/issues/82

I had already fixed it by doing pretty much the same modification as the one you posted in the comment above, but I was looking for a cleaner solution.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dputhier/pygtftk/issues/47#issuecomment-453118372, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvxHpQRhMKFZNERF96ehZwt1mKO4XO4ks5vB1DxgaJpZM4Z5i-2 .

--

Denis Puthier laboratoire INSERM TAGC/INSERM U 1090 Parc Scientifique de Luminy case 928 163, avenue de Luminy 13288 MARSEILLE cedex 09 FRANCE Mail: denis.puthier@univ-amu.fr Tel: (National) 04 91 82 87 31 / (International) 33 4 91 82 87 31 Fax: (National) 04 91 82 87 01 / (International) 33 4 91 82 87 01

Web:

http://tagc.univ-mrs.fr/tagc/index.php/research/network-bioinformatics/dputhier

====================================================================

qferre commented 5 years ago

Using a fix practically identical to yours, there is no segfault, but the error is now : "gtftk peak_anno: error: argument -p/--peak-file: invalid FormattedFile('r') value: 'bed3_h3k4me3.bed'" ?

dputhier commented 5 years ago

Did you try:

type=arg_formatter.FormattedFile(mode='r', file_ext='bed')

qferre commented 5 years ago

This is already what is in the arg_parser of peak anno.

dputhier commented 5 years ago

And it's still working with a bed6 ???

qferre commented 5 years ago

Yep.

dputhier commented 5 years ago

Could you try to change the file name to toto.bed just in case there would be something weird with the regexp...

qferre commented 5 years ago

Still the same error.

dputhier commented 5 years ago

There is also something that seems to be related to pybedtools version. In my hands, I am able, using a bed3 file, to write something like:

    import pybedtools
    pybedtools.__version__ # '0.8.0'

    from pybedtools import BedTool
    a = BedTool("test.bed")
    for i in a:
        pass
    i.name # '.'
    i.name = 'bla'
dputhier commented 5 years ago

In this version, the name/score/strand attributes are set by default to '.'. So we should let the code unchanged and ensure during installation that the pybedtools version is at least '0.8.0'.

qferre commented 5 years ago

That will probably be easier and save us a lot of headaches :) I'll revert all modifications on arg_formatter on my part (I had not commited them anyways)

dputhier commented 5 years ago

What is you version ?

qferre commented 5 years ago

Hmm... the version in my virtual environment for pygtftk was the 0.8.0 too...

dputhier commented 5 years ago

Ask for pybedtools.file to check...

Le jeu. 10 janv. 2019 à 16:21, Quentin Ferré notifications@github.com a écrit :

Hmm... the version in my virtual environment for pygtftk was the 0.8.0 too...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dputhier/pygtftk/issues/47#issuecomment-453132969, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvxHgxuZA3FJ3gw0QnvBalY7HOLuGT6ks5vB1pbgaJpZM4Z5i-2 .

--

Denis Puthier laboratoire INSERM TAGC/INSERM U 1090 Parc Scientifique de Luminy case 928 163, avenue de Luminy 13288 MARSEILLE cedex 09 FRANCE Mail: denis.puthier@univ-amu.fr Tel: (National) 04 91 82 87 31 / (International) 33 4 91 82 87 31 Fax: (National) 04 91 82 87 01 / (International) 33 4 91 82 87 01

Web:

http://tagc.univ-mrs.fr/tagc/index.php/research/network-bioinformatics/dputhier

====================================================================

qferre commented 5 years ago

Yeah that's what I did.

qferre commented 5 years ago

Fixed it. The problem was simply to remember to add the new line character at the end of each line in the temp file while it's being generated.

I have also implemented the fix for the buggy code, discussed above.

The fix is on the peak_anno_shuffling branch, should I upload it to the develop branch as well ?

qferre commented 5 years ago

Clarification : the new line character fixed the second problem (not the segfault).

The segfault was fixed by not trying to write to a non-existent record.name, as pybedtools can have problem with certain operations when they are done inside iterators.

qferre commented 5 years ago

Update : the fix is now part of the develop branch.