dieterich-lab / scimodom

GNU Affero General Public License v3.0
0 stars 0 forks source link

CrossMap not working as expected #76

Closed eboileau closed 4 months ago

eboileau commented 4 months ago

A clear and concise description of what the bug is.

CrossMap bed --chromid s --unmap_file /tmp/unmap.bed GRCm38_to_GRCm39.chain.gz data.bed results.bed
# our data records contain coverage, frequency, and one additional column for processing
cat data.bed
10      7443630 7443631 m6A     1000    -       7443630 7443631 0,205,0 7       27      68
10      7444731 7444732 m6A     1000    -       7444731 7444732 0,205,0 29      81      68
10      7444771 7444772 m6A     1000    -       7444771 7444772 0,205,0 14      38      68
cat results.bed
10      7393698 7393725 m6A     1000    -       7393698 7393725 0,205,0 1       27      0
10      7394799 7394880 m6A     1000    -       7394799 7394880 0,205,0 1       81      0
10      7394839 7394877 m6A     1000    -       7394839 7394877 0,205,0 1       38      0

For BED-like formats mentioned above, CrossMap only updates the “chrom”, “start”, “end”, and “strand” columns. All other columns will be kept AS-IS

But maybe CrossMap believes this is a BED12 file...?

Output or error messages.

12th column is our association_id. On flush, this results in sqlalchemy.exc.IntegrityError: (MySQLdb.IntegrityError) (1364, "Field 'association_id' doesn't have a default value").

Additional context

No response

What browser were you using?

Firefox

What version of Sci-ModoM were you using?

dev

CDieterich commented 4 months ago

Somewhat unexpected based on documentation -> did you open an issue with them ? bed converts BED, bedGraph or other BED-like files. Only genome coordinates (i.e., the first 3 columns) will be updated. Regions mapped to multiple locations to the new assembly will be split.

eboileau commented 4 months ago

Ok, the docs say

CrossMap converts BED files with less than 12 columns to a different assembly by updating the chromosome and genome coordinates only; all other columns remain unchanged. Regions from the old assembly mapping to multiple locations to the new assembly will be split. For 12-columns BED files, all columns will be updated accordingly except the 4th column (name of bed line), 5th column (score value), and 9th column (RGB value describing the display color). 12-column BED files usually define multiple blocks (e.g., exons); if any of the exons fails to map to a new assembly, the whole BED line is skipped.

I tested again and this is consistent with what I observe.

So we're just really unlucky to have exactly 12 columns (but not a BED12 file) in this case...

This extra column https://github.com/dieterich-lab/scimodom/blob/426d044dd54a1f08f614dc8bcbf9998dcfddf8d6/server/src/scimodom/database/models.py#L317 is absolutely required, and has to go into the BED file.

CrossMap fails silently with more than 12 columns.

I hate to do that, but I think we have no choice but to use some trick...

eboileau commented 4 months ago

Also, using the BED importer won't work, so we have to adjust Data importer.