ccmbioinfo / MetaFusion

GNU Lesser General Public License v3.0
8 stars 2 forks source link

CFF file format #6

Open pintoa1-mskcc opened 1 year ago

pintoa1-mskcc commented 1 year ago

I attempted to use the convert to cff helper script provided, however the format outputted is not matching the expected output and it appears the wiki is outdated on how to utilize the tool. The convert_cff helper script returns a bit of a mess, where sample names are cut off, columns are merged incorrectly, and it doesnt have all the columns that are "mandatory" for CFF format (t_gene1 on seems to be missing).

I made my own script to exactly match the format on the wiki: cff_format <- c("chr1","pos1","strand1","chr2","pos2","strand2","library","sample_name", "sample_type","disease","tool",'split_cnt',"span_cnt","t_gene1","t_area1", "t_gene2","t_area2")

However, when I try to run the metafusion.sh, the "reformat" step changes my "strand1" and "strand2" columns to NA columns, then when that is passed onto "renamed" step, I get EMPTY files. I also get a whole bunch of errors.

    except: raise ValueError("CFF Column pos1 value " + tmp[1] + " is not a valid integer\nInvalid entry: " + cff_line)
ValueError: CFF Column pos1 value pos1 is not a valid integer
Invalid entry: chr1 pos1    NA  chr2    pos2    NA  library sample_name sample_type disease tool    split_cnt   span_cnt    t_gene1 t_area1 t_gene2 t_area2

Annotate cff, extract sequence surrounding breakpoint
2345953 annotations from /juno/work/ccs/pintoa1/fusion_report/metafusion/MetaFusion/reference_files/ens_known_genes.renamed.ENSG.bed loaded.
29.4318819046 sec. elapsed.
Warning: Input gene annotations include multiple chr, strand, or regions (5Mb away). Skipping current gene annotation.
set([('CKS1B', 'chr1', 'f'), ('CKS1B', 'chr5', 'r')])
Warning: Input gene annotations include multiple chr, strand, or regions (5Mb away). Skipping current gene annotation.
set([('MIR4461', 'chr5', 'f'), ('MIR4461', 'chr5', 'r')])
Warning: Input gene annotations include multiple chr, strand, or regions (5Mb away). Skipping current gene annotation.
set([('C2orf27A', 'chr2', 'f'), ('C2orf27A', 'chr2', 'r')])
[.....x500]
MetaFusion.sh: line 116: [: -eq: unary operator expected
MetaFusion.sh: line 121: [: -eq: unary operator expected
MetaFusion.sh: line 127: [: -eq: unary operator expected
Merge cff by genes and breakpoints
Traceback (most recent call last):
  File "/juno/work/ccs/pintoa1/fusion_report/metafusion/MetaFusion/scripts/intersect_breakpoints_and_gene_names.py", line 41, in <module>
    df = intersect_fusions_by_breakpoints()
  File "/juno/work/ccs/pintoa1/fusion_report/metafusion/MetaFusion/scripts/intersect_breakpoints_and_gene_names.py", line 20, in intersect_fusions_by_breakpoints
    fusion=pygeneann.CffFusion(lines[0])
IndexError: list index out of range
Error in read.table(fid_intersection_file, header = TRUE, stringsAsFactors = F) : 
  no lines available in input
Execution halted
Traceback (most recent call last):
  File "/juno/work/ccs/pintoa1/fusion_report/metafusion/MetaFusion/scripts/generate_cluster_file.py", line 93, in <module>
    fusion=pygeneann.CffFusion(lines[0])
IndexError: list index out of range

After the "reann" step, my cff file is completely empty and metafusion runs on all the empty files. I have successfully run your test CFF files through Metafusion, however cannot get a real example working.

Would it be possible for an update to the wiki to explain the exact format of CFF, whether or not NA's are allowed, the data type (int, string etc), and whether or not "disease" is important for analysis? At the moment we are putting NAs in the disease slot.

Im assuming that I am NOT supposed to have a header ing a cff format and that it MUST be in the order I specified above?

pintoa1-mskcc commented 1 year ago

so the issue was the header names, Headers are NOT allowed in CFF format Im assuming

mike8115 commented 1 year ago

Thanks for looking into the issue, @pintoa1-mskcc ! It would appear that is correct -- none of the test data has a header. I'll dig into the codebase to look into the assumptions the script makes when parsing CFF files.

@mapostolides, I know it's been a while, but if you know the answer, that would be great! Otherwise, I'll (hopefully) find time later this week.

mapostolides commented 1 year ago

If I remember correctly, the CFF files don’t have headers. The parsing script could be modified to expect a header if the header is needed. The wiki explains the meaning of each field of the CFF. Hope that helps!

On Wed, Mar 29, 2023 at 4:13 PM Michael Li @.***> wrote:

Thanks for looking into the issue, @pintoa1-mskcc https://github.com/pintoa1-mskcc ! It would appear that is correct -- none of the test data has a header. I'll dig into the codebase to look into the assumptions the script makes when parsing CFF files.

@mapostolides https://github.com/mapostolides, I know it's been a while, but if you know the answer, that would be great! Otherwise, I'll (hopefully) find time later this week.

— Reply to this email directly, view it on GitHub https://github.com/ccmbioinfo/MetaFusion/issues/6#issuecomment-1489242358, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJUY47SLXTV6VKMNOCPOHVDW6SJV5ANCNFSM6AAAAAAWKZ2VIY . You are receiving this because you were mentioned.Message ID: @.***>

pintoa1-mskcc commented 1 year ago

Awesome, its fine it doesnt have headers. Could that be added to the documentation that metafusion does NOT expect any header/colnames as an input in the cff file? The wiki is a little misleading as the CFF format tab shows the example with a header

mike8115 commented 1 year ago

Hi all,

I looked into the issue a bit. It looks like line 82 Metafusion.sh isn't checking carefully enough to make sure that valid lines are being captured. Using the column names in the CFF format, chr1 actually passes the regular expression /[0-9XY]/. In fact, alternate chromosome names (e.g. chr12 and 12) and non-existent names (e.g. 9S and 2B) can also pass.

I'll make a quick change to that to fix the initial problem and continue to look at the code to clarify the CFF format specifications.

Michael